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Abstract 

Over  725  steganography  tools  are  available  over  the  Internet,  each  providing  a  method 
for  covert  transmission  of  secret  messages.  This  research  presents  four  steganalysis 
advancements  that  result  in  an  algorithm  that  identifies  the  steganalysis  tool  used  to 
embed  a  secret  message  in  a  JPEG  image  file.  The  algorithm  includes  feature  generation, 
feature  preprocessing,  multi-class  classification  and  classifier  fusion.  The  first 
contribution  is  a  new  feature  generation  method  which  is  based  on  the  decomposition  of 
discrete  cosine  transfonn  (DCT)  coefficients  used  in  the  JPEG  image  encoder.  The 
generated  features  are  better  suited  to  identifying  discrepancies  in  each  area  of  the 
decomposed  DCT  coefficients.  Second,  the  classification  accuracy  is  further  improved 
with  the  development  of  a  feature  ranking  technique  in  the  preprocessing  stage  for  the 
kernel  Fisher’s  discriminant  (KFD)  and  support  vector  machines  (SVM)  classifiers  in  the 
kernel  space  during  the  training  process.  Third,  for  the  KFD  and  SVM  two-class 
classifiers  a  classification  tree  is  designed  from  the  kernel  space  to  provide  a  multi-class 
classification  solution  for  both  methods.  Fourth,  by  analyzing  a  set  of  classifiers, 
signature  detectors,  and  multi-class  classification  methods  a  classifier  fusion  system  is 
developed  to  increase  the  detection  accuracy  of  identifying  the  embedding  method  used 
in  generating  the  steganography  images.  Based  on  classifying  stego  images  created  from 
research  and  commercial  JPEG  steganography  techniques,  F5,  JP  Hide,  JSteg,  Model- 
based,  Model-based  Version  1.2,  OutGuess,  Steganos,  StegHide  and  UTSA  embedding 
methods,  the  performance  of  the  system  shows  a  statistically  significant  increase  in 
classification  accuracy  of  5%.  In  addition,  this  system  provides  a  solution  for  identifying 
steganographic  fingerprints  as  well  as  the  ability  to  include  future  multi-class 
classification  tools. 
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MULTI-CLASS  CLASSIFICATION  FOR  IDENTIFYING  JPEG 
STEGANOGRAPHY  EMBEDDING  METHOD 


I.  Introduction 

Steganography  plays  an  important  role  in  information  security,  i.e.,  any  form  of  covert 
communication.  Literally,  the  meaning  of  steganography  originated  from  the  ancient 
Greek  words  is  “covered  writing”  (The  Oxford  English  dictionary,  1933).  It  puts 
emphasis  on  perceptual  unobservable/undetectable  data  hiding,  i.e.,  the  inability  to  prove 
that  a  cover  file  contains  hidden  data.  In  order  to  hide  secret  infonnation,  three 
components  in  steganography  are  the  stego  message,  cover  file  and  embedding  method. 
Stego  message  is  the  covert  message  that  a  sender  wishes  to  remain  confidential,  such  as 
text,  picture,  audio,  etc.  A  clean  file  is  a  file  that  has  not  been  modified  from  its  original 
characteristics  while  a  cover  file! carrier  is  a  file  in  which  a  message  will  be  hidden 
within.  After  using  an  embedding  method,  the  stego  system  results  in  stego!  dirty  files  that 
are  digital  files  containing  the  hidden  information  with  the  cover  file  and  the  stego 
message  as  input,  i.e.,  files  have  been  manipulated  by  an  embedding  method  by  hiding 
information.  In  the  embedding  and  decoding  procedures,  a  parameter,  stego  key,  shared 
by  the  sender  and  the  receiver  is  used  to  limit  the  authority  of  extracting  the  stego 
message  from  the  stego  file. 

The  classic  model  for  steganography  proposed  by  Simmons  (1984)  is  the  prisoners’ 
problem.  Figure  1.1  illustrates  a  scenario  of  the  problem  that  Alice  and  Bob  are  arrested 
for  a  crime  and  thrown  in  two  different  cells.  They  want  to  develop  an  escape  plan,  but 
the  warden  Wendy  monitors  all  communications  between  the  two  prisoners.  She  will  not 
let  them  communicate  through  encryption  and  if  she  notices  any  suspicious 
communication,  she  will  place  them  in  solitary  confinement  and  thus  suppress  the 
exchange  of  all  messages.  Hence,  both  parties  must  communicate  invisibly  in  order  to 
avoid  arousing  Wendy’s  suspicion;  they  have  to  set  up  a  subliminal  channel.  A  practical 
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way  to  do  so  is  to  hide  meaningful  information  in  some  harmless  message:  Alice  could, 
for  instance,  use  a  digital  photo  of  an  aircraft  and  send  this  image  to  Bob.  Wendy  has  no 
idea  that  the  binary  value  representation  of  the  image  transmits  a  secret  escape  plan 
(stego  message).  After  receiving  the  stego  file,  Bob  reconstructs  the  message  with  a  key 
he  shares  with  Alice. 


Warden  Wendy  Observing  Communications  between  Alice  and  Bob 


Message 


Extracted 

Message 

maANoGR  u>m  Dtrtrms  hi 
Ml  l IK  t AW ru«lll'»l IUM 


Stego 

Decoding 

Algorithm 


*—  Stego  Key  Shared  between  Alice  and  Bob  — 1 


Figure  1.1.  Prisoner’s  Problem,  Schematic  of  the  Principles  of  Steganography. 


Contrary  to  steganography,  steganalysis,  the  main  research  in  this  dissertation,  is  the 
process  for  identifying  a  file  containing  steganography  and/or  extract  the  stego  message. 
Steganalysis  has  progressed  from  the  simple  case  of  determining  whether  an  image 
contains  hidden  information  to  the  more  complex  problem  of  extracting  the  hidden 
information.  With  over  725  steganography  tools  available  over  the  Internet  (Backbone 
Security,  2008)  this  is  an  escalating  problem.  From  a  digital  forensics  standpoint,  it  is 
important  to  extract  the  hidden  data.  A  step  in  the  process  for  doing  this  is  identifying  the 
embedding  algorithm  used  to  create  the  stego  file.  Stego  method  identification  however, 
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is  not  trivial  with  so  many  tools  available.  A  step  towards  algorithm  identification 
requires  detennining  the  class  of  steganography  algorithm  used  during  embedding.  This 
identification  requires  developing  a  steganalysis  system. 

This  research  focuses  on  building  up  a  multi-class  steganalysis  system  for  detecting  the 
secret  in  compressed  images.  The  system  includes  generating  the  features  from  inputs, 
which  are  the  characteristics  of  the  JPEG  images.  The  generated  raw  features  are  sent 
through  a  set  of  preprocessing  steps;  feature  ranking,  feature  selection  and  feature 
extraction,  which  are  used  to  eliminate  redundancies  within  features.  The  preprocessed 
features  are  input  to  SVM  or  KFD  classifiers  using  the  presented  multi-class  tree  with  the 
selected  and  the  fusion  of  classifiers.  The  performance  of  the  system  is  based  on  the 
classification  accuracy  on  input  images,  detennining  clean  or  stego  images  of  which  in 
the  following  JPEG  embedding  methods  are  used:  F5,  JP  Hide,  JSteg,  Model-based, 
Model-based  Version  1.2,  OutGuess,  Steganos,  StegHide  and  UTSA. 

In  the  following  section,  a  background  of  steganalysis  is  given  which  includes  the 
definition  of  steganography,  a  brief  history,  comparisons  with  cryptography  and 
watermarking  along  with  a  definition  of  steganalysis.  Following  this,  a  section  devoted  to 
the  problem  statement  for  a  multi-class  classification  system  is  outlined.  The  problems 
which  are  encountered  in  the  development  of  multi-class  systems  such  as  the  generation 
of  features,  selection  of  the  best  set  of  features,  classification  with  classifier  selection  and 
the  fusion  of  multi-class  classification  methods  are  also  discussed.  The  methodology 
section  gives  an  overview  of  the  multi-class  steganalysis  system  including  the  generation 
of  features  for  JPEG  images,  the  multi-class  tree  structure  for  classification,  selection  of 
the  most  relevant  features  and  the  multi-class  classification  fusion  system.  The  last 
section  concludes  with  the  summary  of  the  topics  discussed  within  this  chapter. 
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1.1  Background 


In  this  section  steganography,  one  of  the  information  hiding  techniques,  is  defined  with 
respect  to  current  multimedia  formats.  A  brief  history  of  steganography  (Kahn,  1996)  is 
given  beginning  with  ancient  Greece  (Littlebury,  1737a;  1737b;  Rawlinson,  1862;  1875; 
1880;  1889)  to  the  current  classical  model.  In  addition,  a  comparison  is  made  between 
steganography,  cryptography  and  watermarking,  which  are  the  other  data  hiding 
techniques,  followed  by  a  definition  of  steganalysis. 

1.1.1  Introduction  to  Steganography 

Communication  systems  have  long  been  used  to  send  and  receive  secret  messages.  In 
many  of  these  systems  the  messages  may  be  transmitted  through  public  communication 
channels  either  open  to  be  viewed  or  concealed  from  an  outside  observer.  Stego  messages 
are  the  ones  that  have  been  hidden  within  innocent  looking  cover  files  creating  a  stego 
file.  Even  though  data  hiding  terminology  is  fairly  modern  due  to  the  popularity  of 
multimedia,  the  roots  of  steganography  can  be  traced  back  to  ancient  Greece  (Littlebury, 
1737a;  1737b;  Rawlinson,  1862;  1875;  1880;  1889).  A  history  of  steganography  was 
written  by  Kahn  (1996)  providing  specific  steganography  events.  Herodotus,  the  father  of 
history,  gives  several  cases  (Littlebury,  1737a;  1737b;  Rawlinson,  1862;  1875;  1880; 
1889).  A  man  named  Harpagus  wanted  to  send  a  secret  message  so  he  killed  a  hare  and 
hid  a  message  inside  its  body.  He  sent  it  with  a  messenger  who  pretended  to  be  a  hunter 
(Littlebury,  1737a,  pp.  80-81;  Rawlinson,  1889,  pp.  201).  In  another  instance  (Littlebury, 
1737a,  pp.  19;  Rawlinson,  1862,  pp.  197),  Histaieus  wished  to  infonn  his  friends  that  it 
was  time  to  begin  a  revolt  against  the  Medes  and  the  Persians.  He  shaved  the  head  of  one 
of  his  trusted  slaves,  tattooed  the  message  on  the  head,  waited  till  his  hair  grew  back,  and 
sent  him  along.  It  worked;  the  message  successfully  reached  his  intended  recipients  in 
Persia  and  the  revolt  succeeded.  Things  worked  more  slowly  in  the  days  before  faxes,  e- 
mail  and  the  Internet.  Herodotus  also  tells  of  a  man  named  Demeratus  who  wanted  to 
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report  from  the  Persian  court  back  to  his  friends  in  Greece  that  Xerxes  the  Great  was 
about  to  invade  Greece  (Littlebury,  1737b,  pp.  278-279;  Rawlinson,  1880,  pp.  187). 
Messages  in  those  days  were  sent  via  writing  tablets  made  of  two  pieces  of  wood,  hinged 
as  a  book,  with  each  face  covered  with  wax.  One  wrote  on  the  wax;  the  recipient  melted 
the  wax  and  reused  the  tablet.  Demeratus  removed  the  wax  of  the  tablet,  concealed  a 
message  on  the  wood  itself  and  recovered  the  tablet  with  wax.  He  then  sent  the 
apparently  blank  tablets  to  Greece.  At  first  nobody  could  figure  out  what  they  meant. 
Then  a  woman  named  Gorgo  guessed  that  maybe  the  wax  was  concealing  something.  She 
removed  it  and  became  the  first  woman  cryptanalyst  (Kahn,  1996).  Unfortunately,  her 
ingenuity  had  fatal  consequences  for  her  husband  Leonidas,  the  king  of  Sparta;  he  died 
with  band  of  Greeks  holding  off  the  Persians  at  Thennopylae  (Littlebury,  1737,  pp.  270- 
27 1 ;  Rawlinson,  1880,  pp.  178). 

1.1.2  Difference  between  Steganography  and  Cryptography 

An  alternative  method  to  steganography  in  secure  communication  is  cryptography.  An 
important  point  to  note  is  that  both  steganography  and  cryptography  provide  secure 
communications  and  may  be  used  concurrently.  Steganography  and  cryptography  differ 
in  execution.  In  cryptography,  the  secret  message  which  is  the  transmitted  file  itself 
cannot  be  recovered  without  the  secret  key;  however,  the  encrypted  file  is  identified  as 
being  sent.  It  helps  to  protect  confidentiality  but  protection  vanishes  after  decryption.  In 
steganography  the  existence  of  the  stego  message  is  concealed  in  a  cover  file  in  a  way 
that  does  not  allow  an  enemy  to  observe  that  there  is  a  message  present  (Petitcolas  et  ah, 
1999).  The  stego  message  can  be  extracted  with  stego  key  as  long  as  the  stego  file  is 
identified  by  which  embedding  method  is  used. 
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1.1.3  Differences  between  Steganography  and  Watermarking 


Except  for  steganography,  watermarking  has  been  the  other  data  hiding  technique  broadly 
used  for  authentication.  Both  of  them  share  many  common  rules  but  the  objectives  for 
these  techniques  are  different.  In  watermarking,  the  important  infonnation  is  the  cover 
media.  The  embedded  data  is  inserted  solely  for  the  protection  of  the  cover  media.  In 
steganography,  the  cover  media  is  not  important.  It  typically  serves  as  a  diversion  from 
the  embedded  data.  Steganographic  communications  are  usually  between  a  sender  and 
single  receiver  while  watennarking  techniques  are  usually  between  a  sender  and  many 
receivers  (Katzenbeisser  and  Petitcolas,  2000).  Digital  watennarking  may  be  thought  of 
as  a  commercial  application  of  steganography,  being  used  to  trace,  identify  and  locate 
digital  media  across  networks  (Johnson  and  Jajodia,  1998A;  1998B).  The 

encoding/decoding  part  of  steganographic  systems  is  similar  to  watermarking.  However, 
steganography  has  reduced  robustness  requirements  allowing  a  higher  embedding  rate. 

1.1.4  Steganalysis 

Steganalysis  is  the  science  of  detecting  hidden  information  within  a  cover  file,  i.e.,  to 
identify  a  file  as  containing  stego  and/or  extract  the  stego  message.  An  investigator  using 
steganalysis  techniques  is  known  as  a  steganalyst,  such  as  Wendy  in  the  prisoner’s 
problem.  Steganalysis  is  a  relatively  young  research  discipline  with  few  articles 
appearing  before  the  late- 1990s  (Kessler,  2004).  The  science  of  steganalysis  was  initially 
intended  to  detect  or  estimate  the  existence  of  stego  information  based  on  observing  some 
data  transfer,  while  having  no  assumptions  of  the  steganography  algorithm  applied 
(Chandramouli,  2002).  In  digital  image  steganalysis  an  analyst  has  three  goals,  first 
determine  if  an  embedded  message  exists,  next  determine  the  embedding  method  used  to 
create  the  stego  image,  and  finally  extract  the  hidden  message.  This  research  focuses  on 
the  second  goal,  that  is,  to  identify  the  embedding  technique  used  to  create  the 
steganography  image.  Several  detection  systems  currently  exist,  so  the  identification 
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problem  becomes  one  of  determining  which  detection  system  has  correctly  identified  the 
embedding  method.  Most  steganalysis  today  is  signature -based,  similar  to  anti-virus  and 
intrusion  detection  systems.  In  this  type  of  application,  the  known  embedding  algorithms 
provide  fingerprints  that  are  added  to  a  steganographic  fingerprint  database  in  which  the 
analyst  creates  a  message  and  uses  a  known  stego  tool  to  create  a  stego  file.  This  known 
stego  file  is  then  analyzed  to  detennine  patterns  for  later  use  against  other  stego  files 
(Silman,  2001).  Steganography  detection  and  extraction  is  generally  sufficient  if  the 
purpose  is  evidently  gathering  related  to  a  past  crime.  Although,  disable  the  hidden 
message  so  that  the  recipient  cannot  extract  it  and/or  alter  the  hidden  message  to  send 
misinformation  to  the  recipient  might  also  be  legitimate  law  enforcement  goals  during  an 
on-going  investigation  of  criminal  or  terrorist  groups  (Jackson,  2003).  The  law 
enforcement  community  does  not  always  have  the  luxury  of  knowing  when  and  where 
steganography  has  been  used  or  the  algorithm  that  has  been  employed.  Generic  detection 
tools  generated  from  emerging  research  capable  of  detecting  and  classifying 
steganography  are  becoming  available,  including  research  prototypes  (Fridrich,  2004; 
Lyu  and  Farid,  2004;  Shi  et  al.,  2005;  Pevny  and  Fridrich,  2007;  Rodriguez  and  Peterson, 
2007;  Wang  and  Moulin,  2007)  and  commercially-available  tools  (e.g.,  ILook 
Investigator,  Inforenz  Forager,  StegalyzerSS,  SecureStego,  StegDetect  (Provos,  2004) 
and  WetStone’s  Stego  Suite). 

The  following  definitions  were  introduced  by  Johnson  and  Jajodia  (1998B)  and  are 
frequently  used  by  the  steganalysis  community: 

•  Stego-only  attack:  The  stego  file  is  the  only  item  available  for  analysis. 

•  Known  cover  attack:  The  cover  and  stego  file  are  both  available  for  analysis. 

•  Known  message  attack:  The  hidden  message  is  known. 

•  Chosen  stego  attack:  The  stego  file  and  tool  are  both  known. 

•  Chosen  stego  message  attack:  The  steganalyst  generates  stego  files  from  a  known 
steganography  tool  using  a  chosen  stego  message. 

•  Known  stego  attack:  The  cover  file,  stego  file  and  stego  tool  are  known. 
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1.2  Problem  Statement 


There  is  an  estimated  725  steganography  methods  available  on  the  Internet  with  the 
majority  being  used  for  hiding  messages  in  digital  images  (Backbone  Security,  2008). 
Several  are  downloadable  for  free  and  have  user  friendly  graphical  user  interfaces  (GUIs) 
(Higgins,  2007).  While  these  tools  have  been  used  to  hide  various  forms  of  information 
for  privacy,  these  tools  have  also  been  used  for  criminal  activity  and  malicious  intent. 
Documented  examples  of  this  have  occurred,  including  an  incident  involving  an  engineer 
sending  an  email  with  two  attached  images  that  turned  out  to  be  a  set  of  stego  files 
containing  intellectual  property  (Radcliff,  2002).  Other  crimes  involving  the  use  of 
steganography  include  child  pornography  where  the  stego  files  are  used  to  hide  a 
predator’s  location  when  posting  digital  pictures  on  Web  sites  or  sending  them  through 
email  (Astrowsky,  2000).  Steganography  may  also  be  used  to  allow  communication 
between  affiliates  of  an  underground  community,  such  as  terrorist  organizations  (Kelley, 
2001).  To  combat  these  image  stego  tools,  an  initial  step  requires  determining  if  an 
observed  image  contains  a  stego  message.  If  an  image  is  identified  as  being  a  stego  file, 
the  second  step  is  determining  the  embedding  method.  This  step  of  identifying  the 
steganography  method  enables  the  steganalyst  to  then  target  the  steganography  method 
and  extract  the  hidden  information  in  a  final  step. 

Identifying  the  tool  used  to  create  the  stego  image  will  help  in  the  extraction  process  of 
removing  the  hidden  message.  Therefore,  a  system  must  be  designed  to  identify  which 
stego  tool  is  used.  Several  factors  must  be  addressed  in  the  steganalysis  multi-class 
classification  system  including  feature  generation,  feature  improvement,  classifier 
selection  and  fusion  as  shown  in  Figure  1.2. 
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Figure  1.2.  Steganalysis  Classification  System  in  Training  Stage. 


As  in  the  training  stage,  given  clean  and  steganographic  image  datasets,  the  system  with 
all  these  procedures  is  trained  to  find  out  the  suitable  parameters  used  for  classification. 
The  trained  classification  model  as  the  output  in  this  stage  contains  parameters  for  feature 
improvement,  the  classifier  parameters,  and  parameters  for  classifier  fusion.  Once  the 
model  is  set,  the  testing  stage  in  Figure  1.3  indicates  the  output  of  the  model  is  which 
stego  method  is  used,  either  none,  F5,  JP  Hide,  JSteg,  Model-based,  Model-based 
Version  1.2,  OutGuess,  Steganos,  StegHide  and  UTSA. 


Several  detection  systems  are  available  from  research  tools  (Lyu  and  Farid  2002;  2004; 
Fridrich,  2004;  Lie  and  Lin,  2005;  Shi  et  ah,  2005;  Xuan  et  ah,  2005;  Fu  et  ah,  2006; 
Pevny  and  Fridrich,  2006;  2007;  Rodriguez  and  Peterson,  2007;  Wang  and  Moulin,  2007) 
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to  commercially  available  systems  (ILook  Investigator  ©  toolsets,  Inforenz  Forager®, 
SecureStego  (Air  Force  Research  Laboratory,  Rome,  NY),  StegDetect  (Provos,  2004), 
WetStone  Stego  Suite™).  Each  of  the  available  systems  has  certain  advantages  over  each 
other.  A  steganalyst  should  use  as  many  of  these  tools  as  possible  when  analyzing  a  set  of 
images.  A  problem  arises  when  each  detection  system  used  potentially  returns  different 
class  labels  representing  different  embedding  techniques.  In  the  event  each  of  the 
detection  systems  identifies  a  different  stego  tool,  the  analyst  must  then  properly 
determine  the  correct  method  from  the  different  set  of  identified  stego  labels.  The 
solution  described  in  this  research  fuses  the  results  of  each  detection  systems  to  get  better 
detection  accuracy  and  alleviate  the  steganalyst  from  having  to  make  this  assessment. 
The  remainder  of  this  section  introduces  the  basic  concept  of  feature  generation,  feature 
improvement,  classifier  selection,  multi-class  classification  and  the  fusion  of  classifier 
systems. 

1.2.1  Feature  Generation 

The  basic  concept  of  generating  features  is  to  transform  a  given  image,  which  contains  an 
extensive  number  of  data  values  in  a  two  dimensional  matrix,  into  a  new  set  of  features. 
If  the  transform  is  suitably  chosen  the  transfonn  domain  features  can  exhibit  high 
information  properties  about  the  original  input  image  in  a  compact  vector  form.  This 
means  that  most  of  the  classification  related  information  is  compressed  in  a  relatively 
small  number  of  values  leading  to  a  reduced  feature  space  (Theodoridis  and 
Koutroumbas,  2006).  For  example,  consider  a  grayscale  image  that  is  of  512x512  pixels. 
This  image  would  contain  262,144  pixel  values,  mapping  the  image  into  a  new  domain 
with  the  use  of  a  transfer  function  can  potentially  represent  the  image  with  a  significantly 
smaller  number  of  values.  The  basic  reasoning  behind  transform-based  features  is  that  an 
appropriate  chosen  transform  can  exploit  and  remove  redundancies  that  usually  exist  in 
digital  images  (Theodoridis  and  Koutroumbas,  2006).  Consider  the  problem  of 
steganalysis,  an  input  image  that  has  been  manipulated  by  an  embedding  method  will 
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contain  changes  that  are  not  visible  to  the  human  eye.  In  the  case  of  JPEG  images  a 
compression  technique  is  used  which  is  based  on  the  discrete  cosine  transform  (DCT). 
Generating  features  for  discriminating  between  a  clean  image  (an  original  cover  file)  and 
a  stego  image  (stego  file)  using  the  DCT  will  eliminate  redundant  pixel  information. 
When  generating  features  derived  from  calculating  the  DCT,  most  of  the  energy  lies  in 
the  frequency  bands  of  the  coefficients  providing  important  information  for  class 
discrimination.  This  however  leads  to  a  large  number  of  features,  which  for  classification 
accuracy  must  be  reduced. 

1.2.2  Feature  Improvement 

With  the  raw  features,  feature  improving  before  classification  is  vital.  The  goal  for 
improving  the  input  features  is  to  select  a  subset  of  feature  and/or  extract  the  most 
feasible  features  able  to  categorize  the  inputs. 

Feature  Ranking  and  Selection  -  The  major  task  in  feature  selection,  given  a  large 
number  of  features,  is  to  select  the  most  important  features  and  reduce  the  dimensionality 
while  retaining  class  discriminatory  information.  This  procedure  is  important  when 
determining  which  features  are  to  be  used  to  train  the  classification  model.  If  features 
with  little  discrimination  power  are  selected  the  subsequent  classification  model  will  lead 
to  poor  classification  performance.  On  the  other  hand,  if  information  rich  features  are 
selected  the  design  of  the  classifier  can  be  greatly  simplified.  In  a  more  quantitative 
description,  feature  selection  leads  to  large  between-class  distances  and  small  within- 
class  distances  in  the  feature  space.  That  is,  features  should  separate  different  classes  by  a 
large  distance,  and  should  have  small  distance  values  between  objects  in  the  same  class. 
Several  methods  are  available  to  identify  individual  features  with  linear  separation,  a  few 
ranking  and  selection  methods  include;  divergence  measure  (Fukunaga,  1990; 
Theodoridis  and  Koutroumbas,  2006),  Bhattacharyya  distance  (Bhattacharyya,  1943; 
Fukunaga,  1990)  and  Fisher’s  linear  discriminant  ratio  (Fisher,  1936;  1943;  Dillon  and 
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Goldstein,  1984;  Fukunaga,  1990;  van  der  Heijden  et  al,  2004;  Bishop,  1995,  2006; 
Theodoridis  and  Koutroumbas,  2006). 

When  measuring  nonlinear  class  separability,  care  must  be  taken  when  using  feature 
ranking  methods.  Ranking  methods  developed  for  specific  classifiers  are  often  best  suited 
for  determining  the  best  set  of  ranked  features.  For  neural  network  classifiers  features  are 
ranking  and  selected  based  on  a  saliency  metric  (Ruck  et  al.,  1990;  Belue  and  Bauer, 
1995)  and  signal-to-noise  ratio  (Bauer  et  al.,  2000).  For  kernel  based  classifiers,  such  as 
kernel  Fisher’s  discriminant  and  support  vector  machines,  method-specific  techniques  are 
best  suited  for  ranking.  These  techniques  include  recursive  feature  elimination  (Guyon  et 
al.,  2002;  Guyon,  2007),  zero-norm  feature  ranking  (Weston  et  al.,  2003),  gradient 
calculations  using  recursive  feature  elimination  (Rakotomamonjy,  2003),  and  kernel 
Fisher’s  discriminant  using  recursive  feature  elimination  (Louw  and  Steel,  2006). 

Feature  Extraction  -  Another  approach  to  reducing  the  dimension  of  the  input  features 
is  to  use  a  transformed  space  instead  of  the  original  feature  space.  For  example  using  a 

transformation  ©(•)  that  maps  the  data  points  x  of  the  input  space,  M",  into  a  reduced 

dimensional  space  where  n  >  P,  creates  features  in  a  new  space  that  may  have  better 

discriminatory  properties.  Classification  is  based  on  the  new  feature  space  rather  than  the 
input  feature  space.  The  advantage  of  feature  extraction  over  feature  selection  is  that  no 
information  from  any  of  the  elements  of  the  measurement  vector  is  removed.  In  some 
situations  feature  extraction  is  easier  than  feature  selection.  A  disadvantage  of  feature 

extraction  is  that  it  requires  the  determination  of  a  suitable  transformation  ©(•).  Some 

methods  include  principal  component  analysis  (Hotelling,  1933;  Dillon  and  Goldstein, 
1984)  and  kernel  principal  component  analysis  (Scholkopf  et  al.,  1998;  Bishop,  2006).  If 
the  transformation  chosen  is  too  complex,  the  ability  to  generalize  from  a  small  data  set 
will  be  poor.  On  the  other  hand,  if  the  transformation  chosen  is  too  simple,  it  may 
constrain  the  decision  boundaries  to  a  form  that  is  inappropriate  to  discriminate  between 
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classes.  Another  disadvantage  is  that  all  features  are  used,  even  if  some  of  them  have 
noise  like  characteristics.  This  might  be  unnecessarily  expensive  in  term  of  computation 
(van  der  Heijden  et  al.,  2004).  It  should  be  noted  that  the  transformation  used  for  the 
input  features  in  the  training  of  the  classification  model  should  also  be  used  for  the 
testing  features. 

1.2.3  Classification 

Given  an  input  sample  the  training  of  a  classification  model  may  consist  of  supervised  or 
unsupervised  learning.  In  supervised  learning  the  input  sample  includes  an  identification 
of  its  class  membership.  In  unsupervised  learning  the  class  of  the  input  sample  is  not 
known  (Jain  et  al.,  2000).  This  research  concentrates  on  supervised  learning.  Supervised 
learning  can  be  further  broken  down  into  subcategories  of  classification  models.  These 
models  include  but  not  limited  to  the  following  classifier  types  (Duda  and  Hart,  1973; 
Fukunaga,  1990;  Theodoridis  and  Koutroumbas,  2006); 

•  Classifiers  based  on  Bayes  decision  theory  include;  Bayesian  networks, 
discriminant  functions,  and  mixture  models,  specifically,  expectation 
maximization.  Linear  classifiers  include;  Bayes  linear  classifier,  Fisher’s  linear 
discriminant,  and  the  perceptron  algorithm. 

•  Nonlinear  classifiers  include;  decision  trees,  kernel  Fisher’s  discriminant,  multi¬ 
layer  perceptron,  radial  basis  neural  networks,  and  nonlinear  support  vector 
machines. 

•  Nonparametric  classifiers  include;  locally  weighted  regression,  and  Parzen 
window. 

These  classifiers  are  predominantly  two-class  classifiers  while  some  can  be  either  two- 
class  or  multi-class  classifiers.  In  this  research  the  concentration  is  on  multi-class 
classification.  The  specific  problem  addressed  is  how  to  design  discriminant  functions 
which  are  able  to  separate  more  than  two  classes  (Duda  and  Hart,  1973;  Platt  et  al.,  2000; 
Schwenker,  2000;  Tax  and  Duin,  2002;  Rifkin  and  Klautau,  2004;  Eibl  and  Pfeiffer, 
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2005;  Wang  and  Casasent,  2005;  Liu  and  Zheng,  2005;  Bishop,  2006;  Middehnann  et  ah, 
2006;  Theodoridis  and  Koutroumbas,  2006;  Yang  et  ah,  2006). 

1.2.4  Classifier  Fusion 

As  noted,  there  is  a  large  pool  of  different  classifiers.  In  the  literature,  classifier  fusion 
has  been  proposed  for  improving  classification  perfonnance  by  exploiting  the  individual 
advantages  of  each  of  the  classifiers  (Woods  et  ah,  1997;  Duin  and  Tax,  2000;  Ruta  and 
Gabrys,  2001;  Shipp  and  Kuncheva,  2002;  Jaeger,  2004;  Kuncheva,  2004;  Leap  et  ah, 
2004;  Theodoridis  and  Koutroumbas,  2006).  The  success  of  classifier  fusion  depends  on 
two  factors  defined  by  Goebel  and  Yan  (2004);  first,  the  proper  selection  of  a  pool  of 
diverse  individual  classifiers  to  be  fused,  and  second,  the  proper  method  of  fusing 
individual  classifiers.  A  third  factor  should  also  be  considered,  that  is  the  subspace  of  the 
classifiers  being  fused.  Identifying  the  appropriate  classifier  for  a  particular  problem  is 
not  trivial.  Selecting  the  single  best  performing  classifier  on  the  training  data  and 
applying  it  to  the  testing  data  is  the  easiest  method.  While  this  approach  is  the  simplest 
the  most  advantageous  performance  may  not  be  guaranteed.  An  increase  of  performance 
can  possibly  be  obtained  by  increasing  the  available  dataset.  When  this  is  not  an  option, 
the  most  reliable  strategy  is  to  evaluate  as  many  different  classifier  designs  as  possible 
and  subsequently  select  the  best  performing  model.  The  difficulty  is  that  such  a  wide 
evaluation  is  computationally  complex.  In  relation  to  classifier  fusion,  selection  identifies 
the  answers  to  which  classifier  and  how  many  classifiers  to  select  in  order  to  obtain  an 
increased  performance.  In  certain  situations,  a  problem  arises  when  the  outputs  of  the 
individual  classifiers  are  of  different  types,  either  discrete  values  or  posterior 
probabilities.  Hence,  the  proper  classifier  fusion  technique  has  to  be  used  for  a  specific 
problem. 
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1.3  Methodology 


The  following  sections  describe  the  multi-class  JPEG  image  steganalysis  system.  Each 
subsection  introduces  the  main  advancements  this  research  provides  in  the  area  of 
steganalysis,  specifically,  feature  generation,  feature  selection,  classifier  selection,  and 
multi-class  classifier  fusion. 

1.3.1  DCT  Feature  Generation 

This  section  describes  the  novel  JPEG  steganalysis  feature  generation  method  used  in  the 
classification  system.  In  this  method  the  DCT  coefficients  are  separated  into  vertical, 
diagonal  and  horizontal  orientation  as  well  as  low,  medium  and  high  frequencies.  This  is 
known  as  DCT  decomposition  (Rao  and  Yip,  1990).  Each  of  the  8><8  blocks  is  divided 
into  nine  DCT  decompositions  represented  by  both  the  frequency  distributions  and 
directions.  The  coefficients  of  interest  are  within  the  vertical,  diagonal  and  horizontal 
orientation  of  the  low  and  medium  frequency  bands.  The  predictors  are  used  to  estimate 
modifications  made  to  an  image  by  an  embedding  method.  In  this  research  four  different 
predictor  methods  are  used.  The  first  is  a  distance  measure  in  which  the  distance  between 
neighboring  coefficients  is  calculated  and  averaged.  The  second  method  used  to  calculate 
the  predictors  is  a  least  squares  linear  regression  technique  on  the  DCT  neighboring 
coefficients  for  JPEG  images  originally  proposed  by  Farid  (2002).  In  the  final  method  the 
predictors  are  calculated  by  shifting  the  8x8  blocks  by  one  pixel  in  the  spatial  domain 
followed  by  recompressing  the  pixels  using  the  JPEG  properties.  To  measure  the 
coefficients,  neighboring  coefficients  and  shifted  coefficients,  180  features  are  generated 
from  higher-order  statistics  that  aid  in  the  assessment  of  changes  made  to  the  image  by  an 
embedding  method.  As  more  features  are  created,  the  problem  becomes  one  of  relevancy 
to  the  actual  classification  problem. 
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1.3.2  Feature  Improvement 


The  feature  ranking/selection  method  used  for  improving  identification  accuracy  is 
designed  for  two  kernel  based  classifiers,  the  kernel  Fisher’s  discriminant  (KFD)  and  the 
support  vector  machines  (SVM).  The  benefit  of  this  feature  selection  method  is  that  the 
classification  algorithms  being  used  assists  in  discriminating  between  important  features 
and  noise  features  by  ranking  features  in  the  kernel  space.  The  ranking  method  consists 
of;  first,  removing  one  feature  at  a  time  from  the  input  space  and  transfonning  the 
remaining  features  into  the  kernel  space,  second,  identifying  the  alpha  vectors  and 
support  vectors,  and  third,  assigning  a  ranking  value  to  the  removed  feature  using  the 
alpha  vectors  and  support  vectors  with  a  new  derived  ranking  measure.  The  selection  is 
based  on  the  percentage  of  features  necessary  to  increase  classification  perfonnance,  and 
is  termed  SVM-Kernel  Feature  Ranking  (SVM-KFR).  This  however,  does  not  resolve  the 
need  to  discriminate  between  several  classes. 

1.3.3  Classification 

For  detecting  stego  messages  in  various  embedding  methods,  a  fusion  of  classifiers  is 
used  to  increase  classification  accuracy.  Prior  to  the  fusion  process,  the  selection  of 
classifiers  is  vital.  One  approach  is  to  first  heuristically  pick  a  number  and  types  of 
classifiers  while  ensuring  a  diverse  output.  Another  approach  is  choosing  classifiers  from 
a  large  pool  to  achieve  classification  performance  as  close  to  an  error  rate  of  zero  as 
possible.  This  should  be  accomplished  while  avoiding  the  exhaustive  evaluation  of  all 
possible  classifier  combinations.  The  classifiers  are  multi-class  classifiers,  including 
Bayes  decision  theory  method  and  expectation  maximization  (EM);  the  nonlinear 
classifier  probabilistic  neural  network  (PNN);  and  nonparametric  classifiers,  /.-nearest 
neighbors  and  Parzen  windows.  Two  nonlinear  kernel  based  methods  are  also  used,  the 
support  vector  machine  (SVM)  and  kernel  Fisher’s  discriminant  (KFD).  These  two 
methods  however  are  two-class  classifiers.  In  this  methodology,  the  focus  is  to  solve 
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multi-class  classification  for  identifying  various  stego  embedding  methods.  In  order  to 
solve  the  KFD  and  SVM  two-class  problem,  a  new  multi-class  classification  tree  is 
designed  specifically  for  the  KFD  and  SVM  where  two-class  classifiers  reside  at  each 
node  of  the  tree.  This  tree  is  designed  by  separating  classes  into  two  groups  at  each  node. 
The  classes  are  grouped  according  to  the  smallest  distances  between  classes.  This  tree  is 
gradually  expanded  by  adding  a  node  each  time  a  set  of  two  or  more  classes  is  identified. 
The  smallest  distance  between  a  set  of  classes  represents  a  low  value  in  classification 
accuracy.  The  distance  measure  is  based  on  the  kernel  transform. 

1.3.4  Classifier  Fusion 

The  output  labels  of  the  multi-class  classifiers  expectation  maximization  (EM),  A-ncarcst 
neighbors  (A-NN),  Parzen  window  and  probabilistic  neural  networks  along  with  the 
output  labels  of  the  new  KFD  and  SVM  multi-class  classifiers  are  fused  to  increase 
classification  accuracy.  Along  with  the  six  multi-class  detection  systems  two  commercial 
tools,  StegAlyzerSS  and  StegoSuite,  are  also  fused.  In  this  work,  the  individual  detection 
systems  are  fused  using  three  fusion  methods;  the  first  method  used  for  fusion  is 
boosting,  specifically  AdaBoost  (Freund  and  Schapire,  1995);  the  second  method  is 
Bayesian  networks  for  model  averaging  (Murphy,  2001);  the  final  method  is  probabilistic 
neural  networks. 

1.3.5  Results 

The  simulation  of  the  methodology  is  done  by  5 -fold  cross  validation  having  both 
training  and  testing.  With  feature  preprocessing,  an  average  increase  in  classification 
accuracy  is  achieved  for  the  individual  multi-class  classifiers,  EM,  A-ncarcst  neighbors, 
Parzen  window,  PNN  by  as  much  as  22%  in  comparison  to  no  features  preprocessing.  A 
multi-class  classification  system  for  KFD  and  SVM  is  created  by  using  a  multi-class  tree. 
With  the  use  of  the  tree  structure  the  classification  accuracy  of  this  new  system  by 
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applying  the  feature  preprocessing  in  the  individual  nodes,  an  increase  in  classification 
accuracy  is  achieved  by  10%  than  without  feature  preprocessing.  With  the  use  of  the 
classifier  fusion,  the  overall  accuracy  by  5%  over  the  best  individual  best  classifier  is 
attained.  Furthermore,  the  perfonnance  of  the  methodology  shows  statistical  difference 
between  the  newly  fused  system  in  comparison  to  the  individual  detection  systems. 

1.4  Summary 

This  chapter  defined  steganography,  provided  its  brief  history  and  how  steganography  is 
used  with  current  multimedia  formats  was  given.  A  definition  of  steganalysis  was  also 
given  followed  by  a  section  devoted  to  the  problem  statement  for  a  multi-class 
classification  system.  The  specific  problems  encountered  in  the  development  of  multi¬ 
class  systems  in  this  chapter  are  generation  of  features,  selection  of  the  best  set  of 
features,  classification  selection  and  the  fusion  of  multi-class  classification  methods.  The 
methodology  for  this  research  was  introduced  in  Section  1.3  which  included  the 
generation  of  features  for  identifying  JPEG  stego  and  clean  images,  selection  of  the  most 
relevant  features,  the  design  of  a  multi-class  classification  system  for  both  KFD  and  SVM 
and  the  fusion  of  multi-class  classifiers. 

Chapter  2  provides  the  necessary  background  and  literature  review  in  solving  the 
complex  problem  of  identifying  the  embedding  method  used.  In  Chapter  3,  the 
methodology  is  described  in  detail  in  which  the  full  detection  system  is  developed.  This 
involves  the  generation  of  features,  the  ranking  and  selection  of  features,  the  design  of  the 
classification  tree  and  the  fusion  of  classifiers  to  solve  the  multi-class  problem.  In 
Chapter  4,  the  results  are  based  on  a  twelve  class  dataset  which  contains  a  set  of  clean 
images  (one  class)  and  steganography  images  (seven  classes).  The  embedding  methods 
targeted  in  this  paper  are  F5  (Westfeld,  2001;  2003),  JP  Hide  (Latham,  1999),  JSteg 
(Upham,  1993),  Model-base  (Sallee,  2003;  2006),  Model-based  Version  1.2  (Sallee, 
2008a),  OutGuess  (Provos,  2004),  Steganos  (2008),  StegHide  (Hetzl,  2003)  and  UTSA 
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(Agaian  et  al.,  2006).  The  classification  results  are  provided  from  EM,  A-nearest 
neighbors,  Parzen  Window  and  probabilistic  neural  networks  multi-class  classifiers,  new 
multi-class  tree  with  KFD  and  SVM,  commercial  tool  and  fusion  of  all  the  multi-class 
systems.  The  results  also  show  the  classification  of  the  embedding  methods  with  the  new 
feature  generation  methods  compared  with  the  wavelet  based  features  (Lyu  and  Farid 
2002)  and  the  DCT  based  features  (Pevny  and  Fridrich,  2007).  These  results  show  four 
techniques  that  improve  classification  accuracy;  first,  the  new  feature  generation  method, 
second,  the  multi-class  tree  allows  the  KFD  and  SVM  to  be  used  as  multi-class  classifier, 
third,  the  selection  of  features  at  each  node  for  the  KFD  and  SVM  classifiers,  and  the 
final  technique  is  the  fusion  of  the  various  classifiers.  Finally,  Chapter  5  provides  a 
conclusion,  contribution  to  DoD  and  future  directions  that  may  be  considered  in 
expanding  the  steganalysis  multi-class  system. 
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II.  Literature  Review 


This  chapter  presents  related  work  relevant  to  the  development  of  a  steganalysis  system. 
There  are  several  sub-components  to  this  research,  including  JPEG  image  representation, 
feature  generation,  feature  preprocessing,  feature  extraction,  feature  selection, 
classification,  multi-class  classification  and  classifier  fusion.  Figure  2.1  shows  the  basic 
structure  of  the  detection  system  and  its  primary  components  discussed  in  this  chapter. 
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Figure  2.1.  Basic  Detection  System. 
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Related  work  on  each  of  these  topics  is  presented  in  this  order  in  the  following  sections. 

•  Image  Representation:  The  JPEG  image  format  is  described  along  with  a  basic 
description  of  the  areas  within  a  JPEG  image  that  are  manipulated  by  an 
embedding  method. 

•  Feature  Generation:  Using  statistical  measures  to  identify  changes  made  to  a 
JPEG  image  by  an  embedding  method,  two  transform  based  methods  in  this 
chapter  generate  one  dimensional  feature  vectors  from  a  matrix  image 
representation. 

•  Feature  Extraction:  The  methods  in  this  chapter  map  a  set  of  feature  vectors  to  a 
lower  dimensional  space. 

•  Feature  Ranking/Selection:  A  subset  of  features  is  chosen  according  to  feature 
ranking,  noise  features  and  class  separability  (means  and  variances). 

•  Classification:  Six  classification  methods  are  described,  i.e.,  expectation 
maximization,  k-nearest  neighbors,  kernel  Fisher’s  discriminant,  Parzen  window 
probabilistic  neural  networks  and  support  vector  machines. 

•  Multi-class  Classification:  The  multi-class  methods  include  true  multi-class 
classifiers  and  the  combination  of  two-class  classifiers. 

•  Classifier  Fusion:  Three  fusion  methods  are  described;  AdaBoost  (Freund  and 
Schapire,  1995),  Bayesian  networks  for  model  averaging  (Murphy,  2001)  and 
probabilistic  neural  networks. 

2,1  JPEG  Image  Representation  Background 

In  this  section,  the  basic  structure  of  the  JPEG  image  format  and  the  steps  in  the 
compression  process  are  described.  This  is  followed  by  a  brief  introduction  of  JPEG 
image  embedding  methods. 

The  Joint  Photographic  Experts  Group  (JPEG)  format  uses  lossy  compression  to  achieve 
high  levels  of  compression  on  images  with  many  colors  (Elysium  Ltd.,  2004).  JPEG  is 
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an  international  standard  for  still  image  compression,  and  is  widely  used  for 
compressing  gray  scale  and  color  images.  JPEG  images  are  commonly  used  for  storing 
digital  photos,  and  publishing  Web  graphics;  tasks  for  which  slight  reductions  in  the 
image  quality  are  barely  noticeable.  Due  to  the  loss  of  quality  during  the  compression 
process,  JPEGs  should  be  used  only  where  image  file  size  is  important  (Murry  and 
vanRyper,  1994;  Brown  and  Shepherd,  1995). 

The  JPEG  encoder,  shown  in  Figure  2.2,  performs  compression  with  the  following 
sequential  steps:  image  preprocessing  (divides  the  input  image  into  8x8  blocks),  forward 
DCT  of  each  8X8  block,  quantization  with  scaling  factor,  separation  of  DC  and  AC 
coefficients,  prediction  of  the  DC  coefficient  and  zig-zag  scan  the  AC  coefficients  and 
Huffman  encoder  (there  is  a  separate  encoder  for  the  DC  and  AC  coefficients). 


Figure  2.2.  Block  Diagrams  of  Sequential  JPEG  Encoder. 


In  JPEG  decoding,  all  steps  from  the  encoding  process  are  reversed.  The  following 
procedure  is  a  short  description  of  the  JPEG  baseline  systems. 
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Preprocessing  block  -  Subdivides  the  image  into  blocks  of  8x8  pixels  and  level-shift  the 
original  pixel  values  from  the  range  [0,  225]  to  the  range  [-128,+  127]  by  subtracting  128. 
The  shifting  procedure  is  a  preprocessing  step  for  the  DCT  calculation. 

Forward  DCT  block  -  Perform  a  two  dimensional  discrete  cosine  transfonn  (DCT)  on 
each  level-shifted  block  B  from  the  Preprocessing  block  step.  The  two  dimension  DCT  is 
defined  as 


C(nvn2,  kv  k2)  = 


1 


2 

JW, 


^cos 


n (2nx  + 1)  kx  \  f  7t{2n2  + 1) k. 


2  N, 
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2JV„ 
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\<kx  <N,-\ 
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(2.1) 


where  0  <  nx  <  Nx  -1  and  0  <  n2  <  N2  - 1  .  In  the  JPEG  encoding  process,  N\  =  Ni  =  8. 
The  transfonn  is  performed  on  the  two  dimensional  matrix  B  as  CBC1 

The  transform  helps  to  remove  data  redundancy  by  mapping  data  from  a  spatial  domain 
to  the  frequency  domain.  No  compression  has  been  achieved  in  this  stage,  but  by 
changing  representation  of  the  information  contained  in  the  image  block  it  makes  the  data 
more  suitable  for  compression. 

Quantization  -  Quantize  the  DCT  coefficients  block  obtained  from  the  previous  step 
using  the  quantization  table  Q.  The  quantization  table  is  a  matrix  used  to  divide  the 
transfonned  block  for  compression  purpose  by  reducing  the  amplitude  of  the  DCT 
coefficient  values  and  increasing  the  number  of  zero  valued  coefficients.  The  Huffman 
encoder  takes  advantage  of  these  quantized  values.  When  Qs  is  represented  the  value  5  is 
a  scalar  multiple,  called  the  scale  (or  quality)  factor,  which  defines  the  amount  of 
compression  within  the  image.  Higher  values  of  s  yield  higher  compression.  Figure  2.3 
shows  an  instance  of  Qs. 
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Figure  2.3.  Typical  quantization  matrix. 

A  set  of  four  quantization  tables  are  specified  by  the  JPEG  standard  (Independent  JPEG 
Group,  1998).  After  quantization,  most  of  the  DCT  coefficients  in  the  8X8  blocks  are 
truncated  to  zero  values.  It  is  the  principal  of  lossiness  in  the  JPEG  transform-based 
encoder. 

DC  Coefficient  Coding  -  The  first  coefficient,  coefficient  1  (upper  left)  in  Figure  2.4  b) 
is  called  the  “DC  coefficient”,  short  for  the  direct  current  coefficient,  and  represents  the 
average  brightness  (intensity)  of  the  component  block.  To  encode  the  DC  coefficient,  the 
JPEG  standard  utilizes  a  Huffman  difference  code  table  that  categorizes  the  value 
according  to  the  number  of  k  bits  that  are  required  to  represent  its  magnitude.  The  value 
of  the  element  is  encoded  with  k  bits. 

AC  Coefficients  Coding  -  The  remaining  63  coefficients  are  the  “AC  coefficients”,  short 
for  the  alternating  current  coefficients.  The  Huffman  code  assigns  short  (binary) 
codewords  to  each  AC  coefficient.  The  AC  coefficient  encoding  scheme  is  slightly  more 
elaborate  than  the  one  for  the  DC  coefficient.  For  each  AC  array,  a  run-length  of  0 
elements  is  recorded.  When  encountering  a  non-zero  element,  the  length  of  Os  is  recorded 
and  the  number  of  k  bits  to  represent  the  magnitude  of  the  element  is  detennined.  The 
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run-length  and  k  bits  are  used  as  a  category  in  the  JPEG  default  Huffman  table  for 
assigning  a  code. 

Using  a  zig-zag  run  encoder  converts  the  8x8  array  of  DCT  coefficients  into  a  column 
vector  of  length  k  (zig-zag  goes  from  left  to  right  and  top  to  bottom).  The  “zig-zag”  scan 
attempts  to  trace  the  DCT  coefficients  according  to  their  significance,  shown  in  Figure  2. 
4. 
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Figure  2.4.  DCT  decomposition  zig-zag 
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The  Huffman  encoding  reduces  the  number  of  bits  needed  to  store  each  of  the  64  integer 
coefficients.  For  example,  when  a  true  color  uncompressed  image  of  size  512x512  pixels 
is  stored  the  file  size  is  769  kilobytes.  However,  this  same  image  store  as  a  JPEG  at  a 
quality  factor  of  75,  the  image  is  stored  in  200  kilobytes  or  smaller.  The  Huffman 
encoding  tables  for  the  DC  and  AC  coefficients  can  be  found  in  Gonzalez  and  Woods 
(1992,  2002,  2007),  Elysium  Ftd.  (2004),  Independent  JPEG  Group  (1998),  and  JPEG 
(1994). 


One  of  the  primary  reasons  using  image  embedding  methods  for  creating  stego  files  is 
due  to  the  number  of  redundant  portions  within  a  digital  image.  The  vast  number  of  JPEG 
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images  on  the  Internet  makes  them  ideal  cover  images  for  hiding  secrete  data  and 
transmitting  them  as  stego  images.  In  JPEG  steganography,  the  stego  message  is 
converted  to  binary  values  and  embedded  into  DC  and  AC  coefficients  prior  to  Huffman 
encoding.  By  embedding  at  this  stage,  the  stego  message  can  be  extracted  without  losing 
the  message.  The  embedding  methods  range  from  simple  embedding  techniques  that  alter 
the  least  significant  bits  (LSB)  of  the  coefficients  such  as  JP  Hide  (Latham,  1999)  and 
JSteg  (Upham,  1993)  to  more  complicated  embedding  techniques  that  maintain  natural 
histograms  of  the  coefficients  such  as;  F5  (Westfeld,  2001;  2003),  JP  Hide  (Latham, 
1999),  JSteg  (Upham,  1993),  Model-base  (Sallee,  2003;  2006),  Model-based  Version  1.2 
(Sallee,  2008a),  OutGuess  (Provos,  2004),  Steganos  (2008),  StegHide  (Hetzl,  2003)  and 
UTSA  (Agaian  et  ah,  2006).  The  six  tools  selected  provide  a  set  of  embedding  methods 
that  differ  in  embedding  strategy.  Investigation  of  these  methods  has  provided  an  insight 
into  six  different  and  unique  embedding  capacities,  embedding  patterns  and  the 
appearance  of  the  individual  feature  spaces.  Another  reason  for  selecting  these  particular 
tools  is  in  previous  research  and  existing  steganalysis  tools,  these  6  embedding  methods 
have  been  used  for  analysis  (Provos  and  Honeyman,  2003;  Lyu  and  Farid,  2004;  Kharrazi 
et  ah,  2005;  Shi  et  ah,  2005;  Xuan  et  ah,  2005;  Fu  et  ah,  2006;  Pevny  and  Fridrich,  2007). 

In  summary,  a  useful  property  of  JPEG  is  that  the  degree  of  lossiness  can  be  varied  by 
adjusting  the  quality  factor  s  (scale  of  the  quantization  table),  shown  in  Figure  2.3.  The 
ease  of  file  sharing  with  JPEG  images  and  its  popularity  over  the  internet  has  made  JPEG 
image  format  a  desirable  cover  file  for  many  stego  methods.  Each  embedding  method 
leaves  a  signature  that  can  be  identified  by  various  statistical  measures.  The  next  section 
describes  feature  generations  methods  used  to  identify  changes  made  to  a  JPEG  image. 

2.2  Feature  Generation  for  JPEG  Images 

Several  steganalysis  feature  generation  methods  used  to  identify  changes  made  to  a  JPEG 
image  have  been  published  (Lie  and  Lin,  2005;  Shi  et  ah,  2005;  Xuan  et  al.,  2005;  Fu  et 


26 


al.,  2006;  Wang  and  Moulin,  2007).  In  this  section  two  well  known  methods  are 
discussed.  The  first  method  developed  by  Lyu  and  Farid  (2002;  2004)  is  a  wavelet  based 
method  in  which  features  are  generated  from  the  wavelet  coefficient  using  various 
statistics.  The  second  method  is  a  DCT  based  feature  generation  method  in  which  the 
features  are  developed  with  the  use  of  functions  for  the  difference  between  DCT 
coefficients  of  input  image  and  of  the  predicted  image  (Fridrich,  2004;  Pevny  and 
Fridrich,  2006). 

The  JPEG  image  coefficients  are  extracted  using  a  transform,  i.e.,  DCT  or  wavelet 
transfonn,  where  the  wavelet  is  calculated  over  the  spatial  domain  not  the  transform 
domain.  These  coefficients  represent  the  image  characteristics  in  a  raw  format,  e.g.,  low, 
mid  and  high  frequencies  for  the  DCT  and  vertical,  horizontal  and  diagonal  for  the 
wavelet  transforms.  The  predictors  which  are  the  estimates  of  where  the  stego  message  is 
hidden  within  an  image  are  based  on  the  feature  generation  method.  Lyu  and  Farid  (2002, 
2004)  use  a  regression  technique  to  develop  the  weights  associated  with  the  coefficients 
to  produce  the  predictors.  Fridrich  (2004)  crops  an  input  image  and  re-expands  the  image 
to  develop  the  predictors.  The  features  are  finally  generated  by  calculating  statistics  from 
the  coefficients  and  the  predictors. 

2.2.1  Wavelet  Statistical  Model 


The  image  decomposition  employed  here  is  based  on  separable  quadrature  mirror  filters 
(Lyu  and  Farid,  2002,  2004).  In  digital  signal  processing,  a  quadrature  mirror  filter  is  a 
filter  bank  which  splits  an  input  signal  into  two  bands,  low-pass  and  high-pass 
frequencies.  The  low-pass  and  high-pass  filters  are  related  by  the  following  equation: 


(2.2) 
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where  <f  is  the  frequency,  and  the  sampling  rate  is  normalized  to  2  n,  as  shown  in  Figure 
2.5. 


Orthogonal  wavelets  such  as  the  Haar  wavelets  and  related  Daubechies  wavelets  are 
generated  by  scaling  functions  which,  with  the  wavelet,  satisfy  a  quadrature  mirror  filter 
relationship  (Addison,  2002;  Gonzalez  and  Woods,  2004).  Farid  (2002)  uses  a  variety  of 
wavelets  but  in  this  related  work  the  symmetric  quadrature  mirror  filters  (Simoncelli  and 
Adelson,  1990)  are  used.  A  wavelet  is  a  mathematical  function  used  to  divide  a  given 
function  into  different  frequency  components  and  study  each  component  with  a 
resolution  that  matches  its  scale.  A  wavelet  transform  is  the  representation  of  a  function 
by  wavelets.  The  wavelets  are  scaled  and  translated  copies  (known  as  daughter  wavelets) 
of  a  finite-length  or  fast-decaying  oscillating  waveform  (known  as  the  mother  wavelet). 
Wavelet  transforms  have  advantages  over  traditional  Fourier  transforms  for  representing 
functions  that  have  discontinuities  and  sharp  peaks  (Gonzalez  and  Woods,  2002; 
Addison,  2002). 
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The  following  explanation  of  the  feature  generation  method  is  from  Lyu  and  Farid  (2002, 
2004).  The  mapping  from  the  spatial  domain  to  the  wavelet  transform  domain,  j(xy)  — > 
V(x,  y ),  / I(x,  y),  and  D(x,  y)  is  a  decomposition  that  splits  the  frequency  space  into 
multiple  orientations  and  scales.  For  a  grayscale  image,  the  vertical,  horizontal  and 
diagonal  subbands  at  scale  i  are  denoted  as  V,(x,  y),  Hj(x,  y),  and  D,(x,  y),  respectively.  In 
Figure  2.6b,  i  is  equal  to  1  for  the  first  level  wavelet  decomposition  and  the  second  level 
decomposition  is  represented  in  Figure  2.6c.  For  a  color  (RGB)  image,  the  decomposition 
is  applied  independently  to  each  color  channel.  The  resulting  subbands  are  denoted  as 
Vi{x,  y),  Il‘(x,  y),  and  D/(x;  y),  where  c  e  {r,  g,  b } . 


a)  b)  c) 

Figure  2.6.  Wavelet  Structure  a)  Simple  image  with  vertical,  horizontal  and  diagonal 
lines  b)  2  level  wavelet  decomposed  c)  3  level  wavelet  decomposition. 


Given  the  decomposed  image,  the  statistical  model  is  composed  of  the  mean  //,  variance 
cr2,  skewness  j\  and  kurtosis  y2  of  the  subband  coefficients  at  each  orientation,  scale  and 
color  channel.  In  order  to  capture  higher-order  statistical  correlations,  a  second  set  of 
statistics  are  collected  that  are  based  on  the  errors  in  a  linear  predictor  of  coefficient 
magnitude.  For  the  purpose  of  illustration,  consider  a  vertical  band  of  the  green  channel 
at  scale  i,  Vf(x,  y).  A  linear  predictor  for  the  magnitude  of  these  coefficients  in  a  subset 
of  all  possible  spatial,  orientation,  scale,  and  color  neighbors  is  given  by: 
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(2.3) 


+W4  \V8  (*,  y  + 1)|  +  \v8  (x/  2 ,  y/  2)|  +  w6\D*  (x,  y)  | 

+w7  |Df  (x/ 2 ,  y/2)|  +  w8  |^r  (x,  v)|  +  w9 (x,  v)| 


where  |-|  denotes  absolute  value  and  wt  are  the  weights.  This  linear  relationship  can  be 
expressed  more  compactly  in  matrix  fonn  as: 

v  =  Qw  (2.4) 

where  v  contains  the  coefficient  magnitudes  of  Vj8(x,y )  strung  out  into  a  column  vector 
(to  reduce  sensitivity  to  noise,  only  magnitudes  greater  than  1  are  considered),  the 
columns  of  the  matrix  Q  contain  the  neighboring  coefficient  magnitudes  as  specified  in 
Equation  (2.4),  and  w  =  (w\  ...  W9)7.  The  weights  w  are  determined  by  minimizing  the 
following  quadratic  error  function: 


E(w)  =  [v-Qwf  (2.5) 

Using  regression  techniques  the  error  function  is  minimized  by  differentiating  with 
respect  to  w : 


dE(w) 

dw 


2  Qt(v-Qw) 


(2.6) 


setting  the  result  equal  to  zero,  and  solving  for  w  to  yield  the  following  solution: 


w=(0Toy'QTv 


(2.7) 


Given  the  large  number  of  constraints  (one  per  pixel)  and  nine  unknowns,  it  is  generally 
assumed  that  the  9X9  matrix  (Qr(Q  will  be  invertible. 

Given  the  linear  predictor,  the  log  error  between  the  actual  coefficient  and  the  predicted 
coefficient  magnitudes  is: 


p  =  log(v)-log(|0w|)  (2.8) 

where  the  log(-)  is  computed  point-wise  on  each  vector  component.  The  log(-)  is  used  to 
scale  the  values  of  the  coefficients.  Note,  if  data  standardization  is  used  on  the  generated 
features  after  the  statistics  are  calculated  the  log(-)  operation  may  be  omitted.  It  is  from 
this  error  that  additional  statistics  are  collected  namely  the  mean,  variance,  skewness  and 
kurtosis.  This  process  is  repeated  for  scales  i  =  1 ,...,»,  and  for  the  subbands  V[  and  Vj\ 
where  the  linear  predictors  for  these  subbands  are  of  the  form: 

\V[{x,  y)\  =  wl  \V[ (x  - 1,  y) |  +  w2 1 V'(x  + 1,  y) |  +  \V[  (x,  y  - 1)| 

+w4  Y' ( x ,  y  + 1)|  +  w5  Y' (x/2,  y/ 2)|  +  w6 1 Drt  (x,  y) |  (2.9) 

+w7  \D; (x/. 2 ,  y/. 2)|  +  w8  Y*  (x,  y) \  +  1 V*  (x,  y)| 

and 

Y*  (x,  t)| = Wj  Y?  (x  - 1,  t)| + w2  Y*  (x + 1,  ^)| + Y?  (x?  y  - 1)| 

+w4 1 V*  (x,  y  + 1)|  +  w5  Yi  (■ X / 2 ,  y/. 2)|  +  w6  |T»f  (x,  y) |  (2.10) 

+w7 1 Dbt  (x/ 2 ,  y/2)|  +  w8 1 V[ (x,  y) |  +  wg  (x,  y) | 

A  similar  process  is  repeated  for  the  horizontal  and  diagonal  subbands.  As  an  example, 
the  predictor  for  the  green  channel  takes  the  form: 
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(2.11) 


Hf  (x,  y) |  =  Wj  | Hf  (x  - 1,  y)\  +  w2 1 Hf  (x  + 1,  v)|  +  w3 1 Hf  (x,  y  - 1)| 
+ w4 1  Hf  (x,y  +  \)\  +  w5\Hf{xl2,yl2)\  +  w6 1  Df  (x,  y)  | 

+  W2  \D?(x/2,y/2)\  +  ws\H '  (x,y)\  +  w9  \H*{x,y)\ 


| Df  {x,  y) |  =  |Df  (x  - 1,  v)|  +  w2 1 Df  (x  + 1,  j)|  +  w3  |Z)f  (x,  v  - 1)| 

+w4  \Df  (x,y  +  1)|  +  w5  |Df  (x/2 ,  j/2)|  +  w6  \Hf  (x,  j;)|  (2. 12) 

+w2  \V(8(x,  v)|  +  w8  (A'  (x,  v)|  +  w9  \d^  (x,  v)| 

For  the  horizontal  and  diagonal  subbands,  the  predictor  for  the  red  and  blue  channels  are 
determined  in  a  similar  way  as  was  done  for  the  vertical  subbands,  Equations  (2.9)  and 
(2.10).  For  each  oriented,  scale  and  color  subband,  a  similar  error  metric,  Equation  (2.1 1), 
and  error  statistics  are  computed. 

For  a  multi-scale  decomposition  with  scales  i  =  l,...,s,  the  total  number  of  basic 
coefficient  statistics  is  36(5  -  1)  (12(5  -  1)  per  color  channel),  and  the  total  number  of 
error  statistics  is  also  36(5  -  1),  yielding  a  grand  total  of  72(5  -  1)  statistics.  These 
statistics  form  the  feature  vectors  to  be  used  to  discriminate  between  images  with  and 
without  hidden  messages.  The  set  of  72  features  representing  an  input  image  are  used  in 
Chapter  4  as  a  subset  of  526  features  for  the  steganalysis  detection  system  in  this 
research. 

2.2.2  DCT  Features 

In  this  method  two  types  of  features  are  calculated  over  an  image,  i.e.,  first  order  features 
and  second  order  features.  The  following  explanation  of  the  generated  features  in  the 
DCT  and  spatial  domains  are  from  Fridrich,  (2004).  A  vector  functional  F  is  applied  to 
the  stego  JPEG  image  Jv  The  stego  image  Jx  is  de-compressed  to  the  spatial  domain, 
cropped  by  4  pixels  in  each  direction,  and  recompressed  with  the  same  quantization  table 
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used  in  decompressing  Jx  to  obtain  J2,  as  shown  in  Figure  2.7.  The  vector  functional  F  is 
then  applied  to  J2.  The  L,  norm  is  defined  for  a  vector/  matrix  as  a  sum  of  absolute  values 
of  all  vector/matrix  elements.  The  final  feature  /  is  obtained  as  an  Ij  norm  of  the 
difference  in  the  vector  functional  between  the  original  and  modified  image  as  follows: 

/  =  ||f(/,)-F(J2)||1i  (2.13) 


jpeg 

file 


jpeg 

file 


Spatial  Domain  -  I(x,y) 


4  pixels 


Figure  2.7.  Feature  generating  structure. 


First  Order  Features  -  The  simplest  first  order  statistic  of  DCT  coefficients  is  their 
histogram.  Representing  the  JPEG  image  with  a  DCT  coefficient  array  dk(u,v )  and  a 

quantization  matrix  Q(u,v),  where  u,v  =  1 , _ ,8,  k  =  1,  ...,  B.  The  symbol  dk(u,v )  denotes 

the  u,vth  quantized  DCT  coefficient  in  the  kth  block,  there  are  total  of  B  blocks.  The  global 
histogram  of  all  64  k  DCT  coefficients  is  denoted  as  Hr,  where  r  =  L,  ...,  R,  L  =  min/f 
dk(u,v)  and  R  =  ma\ki/ dk(u,v). 

There  are  steganographic  programs  that  preserve  the  histogram.  Thus,  individual 
histograms  for  low  frequency  DCT  modes  are  added  to  the  set  of  functionals.  For  a  fixed 
DCT  mode  (u,v),  let ,  r  =  L,. . .,  R,  denote  the  individual  histogram  of  values  dk(u,v),  k  =  1, 
. ..,  B.  Only  histograms  of  low  frequency  DCT  coefficients  are  used  because  histograms 
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of  coefficients  from  medium  and  higher  frequencies  are  usually  statistically  unimportant 
due  to  the  small  number  of  non-zero  coefficients. 

To  provide  additional  first  order  macroscopic  statistics  to  the  set  of  functionals,  dual 
histograms  have  been  included.  For  a  fixed  coefficient  value  d,  the  dual  histogram  is  an 
8><8  matrix  guvd 


g 


d 

uv 


B 

^jd(d,dk{u,v)) 


k= 1 


(2.14) 


where  d(d,  dk(u,v))=  1  if  u=v  and  0  otherwise. 


Second  Order  Features  -  Let  7rand  Ic  denote  the  vectors  of  block  indices  while  scanning 
the  image  “by  rows”  and  “by  columns”,  respectively.  The  first  functional  capturing  inter¬ 
block  dependency  is  the  “variation”  V  defined  as 


8  171-1, 


7,1-1 


X  X  dijk){u’v)-duM)(u’v)  +  X  X  dijk){u’v)-dic(k+i){u’v) 


V  = 


u ,v=l  k= 1 


u,v=l  k—l 


I.+I 


(2.15) 


Most  steganographic  techniques  in  some  sense  add  entropy  to  the  array  of  quantized  DCT 
coefficients  and  thus  are  more  likely  to  increase  the  variation  V  than  decrease. 

Embedding  changes  also  increase  the  discontinuities  along  the  8X8  block  boundaries. 
Two  blockiness  measures  Ba,  a  =  1,2,  have  been  included  to  the  set  of  functionals.  The 
blockiness  is  calculated  from  the  decompressed  JPEG  image  (spatial  domain)  and  thus 
represents  an  integral  measure  of  inter-block  dependency  over  all  DCT  modes  over  the 
whole  image: 
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X  Z|/(8^)-/(8x  +  ^)r+  X  X|/(^,8v)-/(x,8v 


+ 


B  = 


x=\  y= 1 


x=l  y= 1 


iV[(M-l)/8j+M[(iV-l)/8j 


(2.16) 


In  the  expression  above,  M  and  N  are  image  dimensions  and  I(x,y)  are  grayscale  values  of 
the  decompressed  JPEG  image. 


The  final  three  functionals  are  calculated  from  the  co-occurrence  matrix  of  neighboring 
DCT  coefficients.  Recalling  the  notation,  L  <  dk(u,v)  <  R,  the  co-occurrence  matrix  C  is  a 
square  DxD  matrix,  D  =  R  -  L  +  1,  defined  as  follows 


4-1 


c,  = 


YJYJs(s,diAk]{u,v)y(t 

’^4(4fl)  (M’V)) 


k= 1  u  ,v=l 


I r  +  h 


|4C|— 1  8 

’^Ic(k+l)  (W’V)) 


(2.17) 


k- 1  u,v= 1 


Ir  +  fr 


The  co-occurrence  matrix  describes  the  probability  distribution  of  pairs  of  neighboring 
DCT  coefficients.  It  usually  has  a  sharp  peak  at  (0,0)  and  then  quickly  falls  off.  Let  C(J] ) 
and  C(J2)  be  the  co-occurrence  matrices  for  the  JPEG  image  J,  and  its  calibrated  version 
J2,  respectively.  Due  to  the  approximate  symmetry  of  Cst  around  ( s,t )  =  (0,  0),  the 
differences  Cst(J{)  -  CJJ-,)  for  (.s,t)e  {(0, 1 ),  (1,0),  (-1,0),  (0,-1)}  are  strongly  correlated. 
The  same  is  true  for  the  group  (s,t)e  {(1,1),  (-1,1),  (1,-1),  (-1,-1)}.  For  practically  all 
steganographic  schemes,  the  embedding  changes  to  DCT  coefficients  make  perturbations 
by  some  small  value.  Thus,  the  co-occurrence  matrix  for  the  embedded  image  can  be 
obtained  as  a  convolution  CP(q),  where  P  is  the  probability  distribution  of  the  embedding 
distortion,  which  depends  on  the  relative  message  length  q.  This  means  that  the  values  of 
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the  co-occurrence  matrix  CP{q)  will  be  more  “spread  out”.  To  quantify  this  spreading, 
the  following  three  quantities  are  taken  as  features: 


CVoGA)  Co, 0(^2) 

N.rc^j.yc^+c^yc^j^ 


Nn=  ,C/,)-CV  \(J2)+C_  lA(J\)-C_lAU2 )+ C_ ,  _ , (J , )-C  ,  (F )  • 


The  final  set  of  20  vector  functionals  used  in  this  method  is  summarized  in  Table  2.1. 
Three  additional  features  are  listed  in  the  bottom  of  Table  2.1. 


Table  2.1.  All  23  distinguishing  functionals. 


Functional/Feature  Name 

Functional  F(-) 

Global  Histogram 

HI  H\i 

-1 

Individual  Histogram  for 

5  DCT  Modes 

7  21  7  31  7  12  7  22  7  13 

n  h  h  n  h 

\\h2i\\  ’  /?31  ’  \\hi2\\ 

II  Wy  II  II 4  II  II  Zj 

'\h22\  ’Ml 

II  Hz,  II  Hi, 

Dual  Histograms  for  1 1 

DCT  Values  (-5,... ,5) 

g~5  g~4  g4  g 5 

IklLlkl  ’  ! 

II  IIZ.,  II  II  II 

4 II  ’ll  5  II 

g  L  f  L 

M  11  MZyl 

Variation 

V 

L\  and  Li  Blockiness 

Bi,Bi 

Co-occurrence 

/Voo,/V(ii,/V|  1  (features  not  functionals) 

The  features  in  Table  2.1  are  extended  from  23  to  193  by  analyzing  DCT  coefficients  in 
the  range  of  -5  to  5  (Pevny  and  Fridrich,  2007).  Apply  the  cropping  technique  in  Figure 
2.7  with  a  Markov  process  an  additional  81  features  are  created  for  a  total  of  274  features 
(Pevny  and  Fridrich,  2007).  The  set  of  274  features  representing  an  input  image  are  used 
in  Chapter  4  as  a  subset  of  526  features  for  the  steganalysis  detection  system  in  this 
research. 
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2.3  Feature  Preprocessing 


After  features  are  generated  it  is  necessary  to  preprocess  the  features  that  are  to  be  used 
for  classification.  In  many  practical  situations  the  classification  model  may  receive  input 
features  whose  values  lie  within  different  dynamic  ranges.  Thus,  features  with  large 
values  may  inadvertently  influence  classification  over  features  with  small  values. 
Another  problem  arises  when  a  particular  sample  is  not  within  the  same  area  as  the  other 
features.  To  resolve  these  issues  the  feature  preprocessing  methods  used  in  this  research 
are  data  nonnalization  (Theodoridis  and  Koutroumbas,  2006,  pp.  214-215),  data 
standardization  (Dillon  and  Goldstein,  1984,  pp.  12-13)  and  outlier  removal  (Barnett  and 
Lewis,  1994).  The  training  vectors  in  this  section  are  represented  by  x  =  [xi,x2,...,xj  e 

M"  with  a  dimension  of  n  and  the  number  of  sample  defined  as  L 


2.3.1  Data  Preparation 

Data  preparation  scales  the  features  so  that  they  have  similar  magnitudes.  Some  of  the 
procedures  used  for  data  preparation  are  feature  standardization  (Dillon  and  Goldstein, 
1984,  pp.  12-13),  feature  min-max  nonnalization  (Theodoridis  and  Koutroumbas,  2006, 
pp.  214-215),  min-max  global  nonnalization  (Guyon  et  ah,  2006,  pp.  254),  sigmoid 
nonnalization  (Theodoridis  and  Koutroumbas,  2006,  pp.  214-215)  and  softmax  scaling 
(Theodoridis  and  Koutroumbas,  2006,  pp.  214-215).  We  use  zero-mean  normalization 
(feature  standardization)  and  min-max  nonnalization  (feature  normalization)  and  describe 
them  in  more  details. 

Min-max  normalization  performs  a  linear  scaling  on  the  original  data.  The 
nonnalization  is  calculated  by  estimates  of  the  minimum  and  maximum  of  the  values. 
The  normalization  technique  is  defined  for  the  / available  data  samples  and  the  kth  feature 

as: 
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k  =  1,2,...,  w  (2.18) 


-min(x,) 
max(xjt )  -  min  (jct ) 


((b-a)  +  a), 


where  a  and  b  are  scaling  factors.  When  a  =  0  and  b  =  1  the  individual  feature  values  are 
in  the  range  of  [0,1].  In  the  event  that  the  denominator  of  Equation  (2.18)  is  equal  to  zero 
that  feature  is  removed,  avoiding  the  potential  of  nonnalizing  a  feature  of  constants. 

Z-score  normalization  (Standardization)  is  based  on  the  mean  and  standard  deviation 
of  each  feature.  Each  feature  in  this  method  is  separately  standardized  by  subtracting  its 
mean  and  dividing  by  the  standard  deviation  as  follows: 


Xjk-Vk 


k  =  1,2,...,/ 


(2.20) 


where  /4  and  Ok  are  defined  as: 


ft  =71%  (2.21) 

"  i= 1 

^T^rZfe-ft)2  <2  22> 

and  /  is  the  number  of  samples.  In  the  event  that  the  standard  deviation  of  a  particular 

feature  is  zero  (e.g.,  each  element  of  the  observed  feature  is  a  constant  value),  the  feature 
is  discarded. 
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2.3.2  Outlier  Removal 


An  outlier  is  defined  as  a  sample  that  is  inconsistent  with  the  existing  sample  distribution. 
The  inconsistency  is  defined  by  the  analyst  observing  the  input  data.  Outliers  can  be 
discarded  if  the  number  of  samples  is  small  in  comparison  with  the  remaining  samples, 
e.g.,  one  or  two  samples.  Various  guides  are  provided  by  Barnett  and  Lewis  (1994)  to 
determine  a  small  number  of  outliers.  When  a  large  number  of  outliers  exist,  care  must  be 
taken  by  the  analyst.  In  this  case,  the  classification  model  may  have  to  be  trained  to 
accommodate  the  presence  of  outliers,  e.g.,  expectation  maximization  can  be  trained 
using  ellipsoids.  Two  outlier  removal  techniques  are  used  in  the  case  of  multivariate 
outliers  in  this  section.  The  first  is  a  technique  in  which  the  mean  is  used  to  identify  an 
upper  and  lower  boundary  of  a  confidence  interval  to  identify  an  outlier  and  remove  the 
sample  (Barnett  and  Lewis,  1994).  The  second  is  a  multivariate  outlier  technique 
presented  by  Wilks  (1963). 

Confidence  Interval  Outlier  Removal  -  In  confidence  interval  outlier  removal,  any 
sample  outside  of  the  confidence  interval  is  considered  an  outlier.  This  method  assumes 
the  data  is  normally  distributed  and  generates  a  confidence  interval  for  each  feature.  The 
first  step  identifies  an  upper  and  lower  limit  means  from  the  global  mean  as  follows: 

Supper  =  - -  X  X<  '  f°r  V  >  A  (2-23) 

^  upper  i^Supper 

Slower  ’f°r  Xi<M  (2.24) 

°  lower  i^S,„„er 


where  //  is  the  global  mean  vector 


A 


(2.25) 


39 


where  /is  the  number  of  samples,  Supper  and  Slower  are  the  number  of  samples  meeting  the 

criteria  x,-  >  //  and  <  //,  respectively.  The  tenn  i  e  Siower  and  i  e  Supper  indicate  the 
indices  when  the  criteria  x,  <  //  and  x,  >  //  are  met.  This  now  leads  to  the  confidence 
interval  defined  as: 


Aw  “(«(/-  Slower  ))  »  Supper  +  («  {^pper  ~  /)) 


(2.26) 


where  a  is  the  parameter  set  by  the  user.  A  good  starting  point  is  a  =  0.5  allowing  the 
parameter  to  be  adjusted  based  on  the  data  set  being  analyzed.  The  terms  multiplied  by  a 
in  Equation  (2.26)  can  be  replaced  by  the  critical  of  the  /-distribution  as  described  by 
Barnett  and  Lewis  (1994,  page  74)  providing  robustness  of  validity  for  the  confidence 
interval.  Another  alternate  modification  to  Equation  (2.26)  is  to  simply  replace  (//  - 
jUiower)  by  the  standard  deviation  of  jUiower  and  (juupPer  -  //)  by  the  standard  deviation  of 
fJupper  allowing  the  standard  deviation  to  determine  the  confidence  interval. 

Wilks’  Outlier  Removal  -  Wilks'  outlier  removal  technique  uses  an  upper  bound  for 
detection  of  a  single  outlier  from  a  set  of  normal  multivariate  samples  in  which  the 
maximum  squared  Mahalanobis  distance  (Equation  (2.27))  approaches  an  F  distribution 
(Wilks,  1963). 


A2  (2-27) 

In  multivariate  outlier  detection  the  normality  between  samples  is  assessed.  A  partial 
mathematical  description  is  provided  by  Rencher  (2002,  pp.  101-104)  and  expanded  in 
application  by  Trujillo-Ortiz,  et  al.  (2008). 
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Determining  the  threshold  is  defined  by  the  F  distribution  critical  value  (inverse  of  F 
cumulative  distribution  function)  with  n  and  (/ -n-X)  degrees  of  freedom  using  the 

Bonferroni  correction  (Bonferroni,  1935;  1936).  The  final  critical  value  is  defined  by: 


«(/- 1)2 


(2.28) 


/(/-n-l)  +  (/nF) 


The  index  of  an  outlier(s)  is  identified  by  the  following  criteria: 


(2.29) 


This  method  is  provided  in  full  detail  by  Trujillo-Ortiz,  et  al.  (2008). 

2.4  Feature  Extraction 

Feature  extraction  maps  the  input  samples,  x,  from  the  input  feature  space  x  e  1”  to  a 
new  feature  space  z  e  W\  where  n  >  p,  features  are  extracted.  In  this  case,  the 

classification  is  based  on  the  samples  in  the  new  feature  space,  z,  rather  than  on  the  input 
feature  space.  The  advantage  of  feature  extraction  over  feature  selection  is  that  no 
information  from  any  of  the  elements  of  the  input  feature  is  lost.  In  certain  situations 
feature  extraction  may  be  easier  to  calculate  than  feature  selection.  In  this  section  two 
feature  extraction  methods  are  discussed,  principal  component  analysis  (PCA)  where  the 
new  feature  space  z  e  R'"  and  kernel  PCA  where  the  feature  space  z  e  M/;. 
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2.4.1  Principal  Component  Analysis  (PCA) 


The  idea  of  feature  extraction  using  PCA  (Hotelling,  1933)  is  to  represent  a  new  space  in 
a  way  to  extract  mutually  uncorrelated  features  from  the  current  space.  The  new  features 
are  known  as  the  principal  components  after  transform  mapping.  The  dimensionality 
assessment  is  accomplished  by  extracting  the  principal  components  from  the  correlation 
matrix  and  retaining  only  the  factors  described  in  Kaiser’s  criterion  (eigenvalues:  X>  1) 
(Kaiser,  1960).  The  criterion  is  used  as  a  guide  line  to  determine  the  number  of  principal 
components  to  retain  by  calculating  the  correlation  matrix  of  the  input  features.  Each 
observed  variable  contributes  one  unit  of  variance  to  the  total  variance  in  the  data  set. 
Hence,  any  principal  component  that  has  an  eigenvalue,  X,  greater  than  one  accounts  for  a 
greater  amount  of  variance  than  had  been  contributed  by  one  variable.  Additionally,  a 
principal  component  that  displays  an  eigenvalue  less  than  one  indicates  less  variance  than 
had  been  contributed  by  one  variable.  The  covariance  matrix,  Z,  is  used  to  extract 
eigenvectors,  e,  retaining  only  the  number  of  principal  components  corresponding  to 
Kaiser’s  criterion. 

The  basic  concept  of  feature  extraction  using  PCA  is  to  map  x  onto  a  new  space  capable 
of  reducing  the  dimensionality  of  the  input  space.  The  data  is  partitioned  by  variance 
using  a  linear  combination  of  ‘original’  factors.  To  perform  PCA,  let  x  =  [xi,  X2,...,x/]  e 

R"  be  a  set  of  training  vectors  from  the  /(-dimensional  input  space  R”.  The  set  of  vectors  z 
=  [zi,Z2,...,zf]  e  R'"  is  a  lower  dimensional  representation  of  the  input  training  vectors  x 
in  the  m-dimensional  space  R”'.  The  vectors  z  are  obtained  by  the  linear  orthononnal 
projection 


z  =  A T  (x- ju) 


(2.30) 
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where  A  is  an  [n  x  m]  matrix  containing  the  top  m  eigenvectors  and  p  is  the  mean  of  the 
each  set  of  features  from  x. 

2.4.2  Kernel  PCA 

The  Kernel  Principal  Component  Analysis  (Kernel  PCA)  is  the  non-linear  extension  of 
the  ordinary  linear  PCA  (Scholkopf  et  ah,  1998).  The  input  training  vectors  x  =  [xi, 

X2,...,x/]  e  M"  are  mapped  by  a  nonlinear  transformation  o(-):  X—>F to  a  new  dimensional 
feature  space  F  e  M)  The  mapping  ©(•)  is  represented  in  the  kernel  PCA  by  a  kernel 

function  which  defines  an  inner  product  in  M)  This  yields  a  non-linear  (kernel) 

projection  of  data  which  has  a  general  definition  as 


z=ArK(x(.,x/)+6  (2.31) 

where  A  is  an  [/ x  p]  matrix  containing  the  top  p  values,  b  is  a  bias  vector  and  z  e  FT  is 

the  vector  of  extracted  features.  The  eigenvectors  are  not  computed  directly  from  the 
kernel  matrix  The  kernel  matrix  must  be  centered  as  follows: 

Kc  =  A(  x, ,  x . )  -1^  K(  x;,  X/ )  -  K(  x;,  X/ )  l[/x/]  +\[M]K(xi,xj)\[M]  (2.32) 

where  l[/X/]  is  a  \/x/]  matrix  in  which  every  value  is  1//  The  eigenvalues,  A,  and 
eigenvectors,  e,  are  determined  with  the  use  of  K^.  The  bias  vector  b  is  computed  as: 

6=Ar(l[/x/]^(x.,x.)l/-K(x,x.)l[/x/])  (2.33) 
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where  1  ^is  an  [/xl]  vector  with  each  element  equal  to  1// 


2.5  Feature  Ranking/Selection 

When  a  decision  problem  has  an  extremely  large  number  of  features,  often  a 
classification  algorithm  has  difficulty  identifying  the  best  features  to  use  for 
classification.  For  this  reason  one  step  in  the  classification  process  is  the  identification  of 
features  that  retain  as  much  class  discriminatory  information  as  possible.  This  procedure 
is  known  as  feature  ranking/selection  or  reduction.  A  first  step  in  feature 
ranking/selection  is  to  look  at  each  of  the  feature  independently  and  test  its 
discriminatory  capability  for  the  problem.  Although  looking  at  the  features  independently 
is  far  from  optimal,  this  procedure  helps  to  discard  features  that  do  not  separate  the 
classes.  In  this  section,  five  ranking  methods  are  described  which  are  used  in  this 
research,  Bhattacharyya  distance,  Fisher’s  discriminant  ratio,  signal  to  noise  ratio,  kernel 
Fisher’s  discriminant  feature  ranking  and  zero-norm  feature  ranking.  The  selection  of 
vital  features  for  each  of  these  methods  is  determined  by  the  user  based  on  either  a 
ranking  value  threshold  or  the  classification  accuracy  of  a  selected  subset  of  top  ranked 
features. 

2.5.1  Bhattacharyya  Distance 

The  Bhattacharyya  distance  is  used  as  a  class  separability  measure.  For  two-class  normal 
distributions  the  Bhattacharyya  distance  is  defined  as: 


B 


1  T  C 

o(/'-i-A+i) 


2_i  +E+1 


V1  1 

(/A-A+i)  +  -ln 


E_i  +E+1 


2  JXJIZ, 


(2.34) 
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where  |  •  |  denotes  the  determinant  of  the  respective  matrix.  The  Bhattacharyya  distance 
corresponds  to  the  optimum  Chernoff  bound  whenS  ,  =  E+1 .  It  is  readily  seen  that  in  this 

case  the  Bhattacharya  distance  becomes  proportional  to  the  Mahalanobis  distance 
between  the  means.  It  should  be  noted  that  the  Bhattacharya  distance  consists  of  two 
tenns.  The  first  tenn  gives  the  class  separability  due  to  the  mean  difference  and 
disappears  when  ju  x  =  //  ,  .  The  second  term  gives  the  class  separability  due  to  the 
covariance  difference  and  disappears  when  I  ,  =  E+1  (Fukunaga,  1990). 


The  Bhattacharyya  distance  for  the  multi-class  case  is  represented  as: 


<J:  +<j. 


.-i 


v  2ct^,  / 


,i*j  (2.35) 


where  i,  j  eZ  in  this  case  corresponding  to  the  classes  C  =  C,  =  [Ci,C2,...,Cc],  j  = 
1,2 ,...,c.  In  this  case  for  each  feature  an  individual  class  is  compared  to  the  remaining 
classes  based  on  distance.  The  features  are  assigned  a  ranking  value  according  to  the 
greatest  distance  between  classes. 

2.5.2  Fisher’s  Linear  Discriminant  Ratio  (FDR/F-Score) 

The  FDR  is  used  to  quantify  the  separability  capabilities  of  individual  features  (Fisher, 
1936).  FDR  is  a  simple  technique  which  measures  the  discrimination  of  sets  of  real 
numbers.  The  within-class  scatter  matrix  is  defined  as 


(2-36) 

c 


where  Sc  is  the  covariance  matrix  for  class  C  e  {- 1 ,+ 1 } 
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(2.37) 


l  Q 

^c=X(x--//c)(x,-//c)r 

1=1 

ieC 


and  Pc  is  the  a  priori  probability  of  class  C.  That  is,  Pc  ~  £cJ£,  where  fc  is  the  number  of 
samples  in  class  C,  out  of  a  total  of  £  samples.  The  between-class  scatter  matrix  is 
defined  as 


S  b  ZPcU  p){Mc  m) 

c 


(2.38) 


where  //  is  the  global  mean  vector 


M 


and  the  class  mean  vectors  pc  is  defined  as 


Pc 


i=i 

zeC 


(2.39) 


(2.40) 


These  criteria  take  a  special  form  in  the  one-dimensional,  two-class  problem.  In  this  case, 
it  is  easy  to  see  that  for  equiprobable  classes  5W  |  is  proportional  to  <y\  +  cr, 

and  |  .S’ls  |  proportional  to  (//_,-  p+l )" .  Combining  Sb  and  Sw,  the  Fisher’s  Discriminant 
ratio  results  in  the  following  equation 


FDR  = 


g\  +  <y; 


(2.41) 


46 


FDR  is  sometimes  used  to  quantify  the  separability  capabilities  of  individual  features.  For 
the  multi-class  case,  averaging  forms  of  FDR  can  be  used.  One  possibility  is 


M  M 


fdr=YL 


i  J«  <7?  '  v] 


(2.42) 


where  the  subscripts  i,j  refer  to  the  mean  and  variance  corresponding  to  the  feature  under 
investigation  for  the  classes  C„  Q .  respectively. 

For  the  one-dimensional  multi-class  case,  the  Fisher’s  discriminant  ratio  is  modified  as: 


FDR, 


2  2 

of+a; 


(2.43) 


2.5.3  Signal-to-Noise  Feature  Selection 

One  method  for  neural  networks  feature  selection  uses  a  signal-to-noise  ratio  (SNR) 
saliency  measure  (Bauer  et  ah,  2000).  This  measure  directly  compares  the  saliency  of  a 
feature  to  that  of  an  injected  noise  feature.  The  SNR  saliency  measure  is  computed  using 
the  following: 


SNR,  =  101og10 


(2.44) 


where  SNR,  is  the  value  of  the  SNR  saliency  measure  for  feature  i,  J  is  the  number  of 
hidden  nodes,  w),  is  the  first  layer  weight  from  node  i  to  node  j,  and  ny  .  is  the  first  layer 
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weight  from  the  injected  noise  node  N  to  node  j.  The  weights,  wlN  .  for  the  noise  feature 
are  initialized  and  updated  in  the  same  fashion  as  the  weights,  w)  .  emanating  from  the 

other  features  in  the  first  layer.  The  injected  noise  feature  is  created  such  that  its 
distribution  follows  that  of  a  Uniform  (0,1)  random  variable.  The  SNR  screening  method 
potentially  requires  only  a  single  training  run,  because  the  SNR  saliency  measure  appears 
highly  robust  relative  to  the  effects  of  weight  initialization.  For  the  classification  method 
probabilistic  neural  network  described  in  Chapter  3,  this  method  is  used  to  detennine  the 
appropriate  subset  of  features. 

2.5.4  Kernel  Fisher’s  Recursive  Feature  Elimination 

The  SVM-RFE  (Guyon  et  ah,  2002)  discussed  in  Section  2.1  is  extended  to  the  kernel 
Fisher’s  discriminant  (KFD)  for  feature  ranking.  The  method  in  this  subsection  starts 
with  all  n  available  features,  and  performs  KFD  on  the  kernel  space  alpha  vectors  a 
(Louw  and  Steel,  2006).  The  feature  ranking  value  for  the  kernel  Fisher’s  recursive 
feature  elimination  (KF-RFE)  is  calculated  as 


aTM[l,,)a 

aTN(m)a 


(2.45) 


where 


M(m) 

mS° 


M W 


i,Xj 


i’XJ 


) 

) 


(2.46) 
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and 


N(m)  = 


(2.47) 


where  C  =  {C.  1,  C+i}  =  {-1,  +1}.  The  KF-RFE  algorithm  consists  of  the  following  steps: 

1 .  Calculate  the  alpha  values  as: 

a=(M_1-M+l)/N 

2.  For  the  number  of  input  features  n  initialize  the  feature  dimensionality  as  n  =  n, 
perform  steps  3  through  6,  n  times. 

3.  For  the  number  of  input  features  n  perfonn  steps  3  through  5,  n  times. 

4.  Assign  ranking  values  Rm  by  calculating  Equation  (2.45),  removing  one  feature  at 
a  time  at  location  m. 

5.  Sort  the  ranking  values  Rm  removing  the  highest  ranked  feature,  storing  the  index 
of  the  removed  feature  and  assign  the  new  dimension  as  n  <—  n- 1 . 

2.5.5  Zero-Norm  Feature  Ranking 

Weston,  et  al.  (2003)  proposed  a  zero-nonn  feature  ranking  method  capable  of 
identifying  features  that  are  close  to  linear  separation.  This  method  was  extended  to  the 
nonlinear  case  by  using  support  vector  machines  with  kernels  capable  of  separating  non¬ 
linear  features.  The  nonlinear  feature  selection  method  calculates  ranking  values  ( Rm )  as 
follows: 


K  =  I 

kj 


akajykyj , 


i[K(xk,xj}K{m)(xk,xj)) 


(2.48) 


where  (•)  in  this  method  is  a  point  by  point  multiplication  of  the  two  kernel  matrices.  The 
zero-nonn  feature  ranking  algorithm  consists  of  the  following  steps: 
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1 .  Initialize  the  weights,  w,  to  ones. 

2.  Weight  x  by  the  weights  w,  x  <—  x  •  w. 

3.  Using  the  selected  SVM  model  identify  the  alpha  values,  a,  and  support  vectors,  x*. 

4.  Calculate  Equation  (2.48). 

5.  Calculate  the  new  weights  w  <—  w  max (/?,„)  -  Rw \T. 

6.  Sort  the  weights  w,  identify  the  weights  that  are  less  than  a  set  threshold,  remove 
the  features  corresponding  to  the  identified  weights,  and  store  the  index  of  the 
ranked  feature. 

7.  Repeat  steps  2  through  6  until  the  maximum  number  of  iterations  is  met  or  all  of 
the  features  have  been  ranked. 

The  threshold  used  in  step  6  is  set  to  the  maximum  w  divided  by  10  (Weston  et  ah, 
2006).  The  remaining  weights  w  for  the  nonlinear  case  should  be  nonnalized  between 
zero  and  one  to  avoid  an  unnecessary  feature  increase  in  step  2.  The  maximum  number  of 
iterations,  20  (Weston  et  al.,  2006),  in  step  7  avoids  calculating  n  number  of  SVM  models 
in  step  3. 

2.6  Classification 

Machine  learning  for  a  classification  task  involves  training  over  a  set  of  samples  x  =  [x\, 
X2 ,...,xf]r  e  M".  Each  sample  in  the  training  set  contains  one  target  value  C  =  C,  = 

[Ci,Ci,...,Cc],  j  =  1,2 ,...,c,  (known  as  the  class  labels  e  C,  i  =  1,2,...,/)  which 

describes  the  class  to  which  the  sample  is  a  member  of.  The  objective  is  to  separate  the 
data  into  their  classes  such  that  the  degree  of  association  is  strong  between  the  data  sets 
of  the  same  class  and  weak  between  members  of  different  classes.  From  the  class 
separation,  an  unseen  sample  xo  e  M"  can  then  be  appropriately  classified.  In  this  section 

six  classification  methods  are  presented,  expectation  maximization  with  mixture  models 
(EM),  A-nearest  neighbors  (A-NN),  kernel  Fisher’s  discriminant  (KFD),  Parzen  window, 
probabilistic  neural  networks  (PNN)  and  support  vector  machines  (SVM). 
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2.6.1  Expectation  Maximization  (EM) 


The  idea  behind  the  EM  algorithm  (Dempster  et  ah,  1977)  is  that  even  though  the  data 
values  of  x,  feature  vectors  x  e  M",  are  unknown/incomplete  the  distribution  f[x\p)  can  be 

used  to  determine  an  estimate  for  the  maximum  likelihood  (Tomasi,  2006).  In  maximum 
likelihood  estimation,  the  estimate  to  be  modeled  is  the  parameter(s)  for  which  the 
observed  data  are  the  most  likely.  This  is  done  by  iteratively  estimating  the  data 
parameters,  then  using  the  data  to  update  the  estimated  parameters,  until  a  desired 
convergence  is  met.  The  two  major  steps  of  the  EM  algorithm  are  the  expectation  step  (E- 
Step)  and  the  maximization  step  (M-Step). 

The  EM  algorithm  consists  of  choosing  initial  parameters  for  the  means,  /Lq’ ,  standard 

deviations,  <j[j\  and  mixing  probabilities,  p  n  (k  |  /) ,  for  a  user  defined  number  of 

clusters,  k,  then  performing  the  E-Step  and  M-Step  successively  until  convergence,  where 
i  is  the  current  iteration  and  n  is  the  number  of  samples.  The  convergence  criteria  is 

determined  by  examining  when  the  parameters  quit  changing,  i.e.,  when  p\']  -  ju[J  fl)  <s 

&  o\']  -<j[' ^  <  e  &  {k  |  /)-/?(;+1)  {k  |  /)  <  s  for  some  epsilon  (£■)  and  distance 

calculation  (Euclidian  distance).  The  maximum  likelihood  estimation  is  a  method  of 
estimating  the  parameters  of  the  distributions  based  upon  the  observed  data. 

The  expectation  step  (E-Step)  calculates  the  membership  probabilities,  p[k  |  /)  (Tomasi, 
2006).  The  mixing  probabilities  pk  are  viewed  as  the  sample  mean  of  the  membership 
probabilities  p(k  |/)  assuming  a  uniform  distribution  over  all  the  data  points.  The 
Gaussian  function,  g(x;/4°,cr[!)  j,  is  used  to  compute  mixture  of  Gaussian  functions  as 
shown  in  the  denominator  of  p  (k  \  /) . 
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(2.49) 


p[J)  (k |/) 


P^gi^P^P)^ 
X!  Pki]g  (x;  Mkh  *  aT ) 


(2.50) 


The  maximization  step  (M-Step)  uses  the  data  from  the  expectation  step  as  if  it  were 
measured  data  to  determine  the  maximum  likelihood  estimate  of  the  parameter  (Tomasi, 
2006).  This  estimated  data  is  often  referred  to  as  the  “imputed”  data.  This  step  is 
dependent  upon  the  membership  probabilities  p(k  |/)  which  are  computed  in  the  E- 

Step.  The  EM  algorithm  consists  of  iterating  the  mean,  standard  deviation,  and  mixing 
probabilities  until  convergence.  The  mixing  probabilities  are  the  sample  mean  of  the 
conditional  probabilities  p[k  | /)  assuming  a  uniform  distribution  over  all  the  data 
points. 


YjPJ{k  iOx- 

/4i+1)=JV- -  (2-51) 

YjPJ{k  10 


^'+1)= 


■ 


U+ 1) 


D 


TjPJ(k  10 


Pi/+1)=-j'Zpj  ik\i) 

"  1=1 


(2.52) 


(2.53) 
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2.6.1. 1  Mixture  Models 


In  mixture  models,  also  known  as  model-based  Gaussian  clustering,  the  multivariate 
Gaussian  normal  is  used  as  a  density  function  similarly  described  in  Equation  (2.50).  The 
general  multivariate  normal  density  for  n  dimensions  is 


S  ( ) 


(2.54) 


The  geometric  characteristics  (size,  shape  and  orientation)  of  the  clusters  are  detennined 
by  the  covariance  matrix  E*  which  is  generated  in  terms  of  eigenvalue  decomposition 

described  in  Martinez  and  Martinez  (2002).  The  decomposition  of  the  covariance  matrix 
Ea  is  used  as  a  suitable  model  for  the  geometric  characteristics  of  the  cluster.  The 
structure  of  the  covariance  matrix  is  as  follows: 


sa=4d/Ad! 


(2.55) 


where  T*  is  a  scalar,  Da  is  the  orthogonal  matrix  of  eigenvectors  and  Ak  is  a  diagonal 
matrix  whose  elements  are  proportional  to  the  eigenvalues  of  Ea.  Note  that  in  EM  the 
values  pk,  pk,  and  oa  are  updated  after  each  iteration  and  in  the  mixture  models  oa  is 
replaced  by  EaE)  represent  the  geometric  characteristics  of  the  clusters. 

The  eigenvalue  decomposition  can  be  modeled  as  various  clustering  arrangements. 
Celeux  and  Govaert  (1995),  describe  in  detail  fourteen  models  based  on  the  eigenvalue 
decomposition.  Allowing  for  variations  in  the  orientation,  volume,  shape  and  size  of  the 
clusters;  six  of  these  models  are  shown  in  Table  2.2  (Martinez  and  Martinez,  2002). 
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Table  2.2.  Parameterization  for  mixture  models. 


Model 

Geometric 

Shape 

Volume 

Shape 

Orientation 

1 

XL 

Spherical 

Equal 

Equal 

NA 

2 

Akl 

Spherical 

Variable 

Equal 

NA 

3 

ADADr 

Ellipsoid 

Equal 

Equal 

Equal 

4 

-A-a-D* 

Ellipsoid 

Variable 

Variable 

Variable 

5 

ADkADTk 

Ellipsoid 

Equal 

Equal 

Variable 

6 

WAD  l 

Ellipsoid 

Variable 

Equal 

Variable 

The  eigenvalue  decomposition  can  be  modeled  as  various  clustering  arrangements,  i.e., 
spheres,  ellipsoids  and  rotations  of  ellipsoids.  Allowing  the  orientation,  volume,  shape 
and  size  of  the  clusters  define  the  various  models  used.  Figure  2.8  shows  the  mixture 
model  using  rotated  ellipsoids  (Model  4)  to  generate  the  decision  boundary  around  each 
class. 


Figure  2.8.  Expectation  Maximization  using  mixture  models  with  Decision  Boundary. 
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2.6.1.2  Bayes  Classifier 


The  EM  algorithm  can  be  used  to  find  a  class  label  for  an  input  sample.  Classification 
uses  input  samples  described  by  feature  vectors  x0  e  R"  to  assign  the  samples  to  a  given 

class  C  =  Cj  =  [Ci,C2,...,Cc],  j  =  1,2 ,...,c.  The  Bayes  classifier  extends  a  general 
multivariate  normal  case  where  the  covariance  matrix  Ey  for  each  class  is  different.  For 
the  multi-class  classifier  each  class  must  have  individual  conditional  probability  densities 
where  the  densities  are  modeled  as  normal  distributions.  The  classes  Cy  are  defined  as 
nonnal  distributions  centered  about  the  mean  vector  /jj.  The  mean  vector,  //,,  and  the 
covariance  matrix,  Ey,  are  calculated  using  the  EM  algorithm.  The  vector  x0  is  a  n- 
dimensional  vector  of  the  observed  data,  and  |E,j  and  E'\  are  the  detenninants  and  inverse 
covariance  matrix  of  the  given  class.  The  posterior  probability  of  class  membership  can 
be  calculated  by  Bayes  rule  if  Cj  is  defined  as  the  event  of  belonging  to  population  j. 
Using  the  density  function  g(x;/4!),af ( j  (Tomasi,  2006),  the  Bayes  classifier  can  be 

expressed  in  terms  of  the  prior  probabilities,  P(C,),  and  posterior  probability  of  class 
membership  as  follows: 


Rc.) 

P(c,Kh—- 

Ep(c 

;=i 

where  the  a  priori  probabilities  P{Cj)  are  the  estimates  of  belonging  to  a  class  and  under 
the  assumption  that  Ey=£  for  V  j. 
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2.6.2  k-Nearest  Neighbors 


A'-Nearcst  Neighbors,  Figure  2.9,  is  a  lazy  learning  approach  that  compares  new  samples 
with  all  of  the  samples  in  the  training  set,  looking  for  the  kth  nearest  (Cover  and  Flart, 
1967;  Duda  et  ah,  2001;  Bishop,  2006). 


Figure  2.9.  k-NN  Decision  Boundary. 


Let  the  vectors  x  =  [xi,x2,...,xf]r  e  R"  and  class  labels  y,  e  C  =  [Ci,C2,...,Cc],  c  e  Z,  i  = 

1,2,...,£,  be  a  set  of  training  vectors.  Given  an  unknown  feature  vector  xo  and  a  distance 

measure,  the  algorithm  for  the  k-nearest  neighbor  rule  is  as  follows  (Theodoridis  and 
Koutroumbas,  2006): 

•  Out  of  the  /  training  vectors  x,  identify  the  k-nearest  neighbors,  irrespective  of 

class  label,  k  is  chosen  to  be  odd  for  a  two-class  problem,  and  in  general  not  to  be 
a  multiple  of  the  number  of  classes. 

•  Out  of  the  k  samples,  identify  the  number  of  vectors,  kj,  that  belong  to  class  C, 
where  ^k.  =  k . 
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•  Assign  xo  to  the  class  C  with  the  maximum  number  of  kj  of  samples. 

The  distance  measures  used  from  the  feature  x,  to  each  of  its  ^-nearest  neighbors  include 
the  Euclidian  and  Mahalanobis.  The  advantage  of  ^-nearest  neighbor  is  the  simplicity  of 
the  assignment  procedure.  The  disadvantage  of  the  method  lies  in  the  necessity  to  store 
all  samples  and  compare  each  with  an  unknown  sample  (Fukunaga,  1990). 

2.6.3  Kernel  Fisher’s  Discriminant  (KFD) 

The  kernel  Fisher  discriminant  is  the  non-linear  extension  of  the  linear  FED  (Jaakola  and 
Haussler,  1998;  Mika  et  ah,  1999;  Scholkopf  and  Smola,  2002).  In  the  linear  case, 
Fisher’s  discriminant  is  computed  by  maximizing  the  coefficients  of  the  following 
equation 


wTSBw 

wTSirw 


(2.57) 


To  use  the  Fisher’s  discriminant  for  nonlinearly  separable  data  Mika,  et  al.  (1999)  map 
the  input  feature  space  with  the  use  of  a  kernel.  The  input  space  is  represented  by  a 
training  set  x,  of  vectors  with  a  feature  dimensionality  of  n.  The  corresponding  class 

labels  are  represented  as  v,  e  C,  where  C  =  [C. i,  C+i]  =  [-1,  +1],  i  =  l,2,...,f  and  f  is  the 

training  set  size.  The  basic  idea  is  to  first  map  the  input  features  from  the  input  space  to 
the  kernel  space  via  a  kernel  function  and  then  perform  linear  FED.  The  aim  is  to  find  a 

direction  w  =  Z/«/0(x;)  from  the  feature  space  to  the  kernel  space  given  by  alpha  vectors 

a  =  [ai,..., af]r  (Mika  et  al.,  1999).  Using  the  definitions  of  SB  and  Sw  the  Fisher’s  linear 
discriminant  in  the  mapped  feature  space  can  be  defined  as 
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(2.58) 


J(a) 


aT Ma 
aT Na 


where M=  (M_\-M+\)(M.\-M+\)r is  a  [C/|  matrix, 


M_\ 

M+l 


1  -1  7=1 
7'eC_ 


E^(x,.,x.) 


1  /+1  /  X 

Y~  Z  a:(xi>xj) 

^  +1  7=1 


7gC+] 


(2.59) 


and 


(2.60) 


where  C=  {C.i,  C+i}  =  {-!,+!}. 


In  (Mika  et  ah,  1999),  numerical  issues  and  regularization  are  discussed  regarding  the 
calculation  of  (2.60).  This  is  resolved  by  simply  adding  a  multiple  of  the  identity  matrix 
to  N  defined  as: 


Nm=N  +  /uI. 


(2.61) 


The  next  step  is  to  use  the  alpha  vectors  and  the  kernel  matrix  to  project  the  n- 1 
dimensional  input  feature  space  into  a  one  dimensional  space  as  follows: 


x  =  K(xi,xj 


(2.62) 
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The  projection  in  (2.61)  now  becomes  the  space  that  is  to  be  solved  using  an  optimization 
solution  to  maximize  the  margin  of  separation  between  classes  as  shown  in  Figure  2.10. 
In  (Mika  et  ah,  1999),  the  Matlab  Optimization  Toolbox  (2004)  is  used  to  solve  the 
optimization  problem  (Scholkopf  and  Smola,  2002)  with  the  projected  space  calculated  in 
(2.62).  In  this  research  the  one  dimensional  SMO  (Franc  and  Hlavac,  2007)  is  used  as  the 
optimization  solution.  This  results  in  the  non-negative  alpha  vectors  ai  ={al,...,a/)  with 

an  upper  bound  C,  C  >  ae  >  0  .  The  support  vectors  for  the  KFD  trained  model  are  \k  =  x, 
and  the  decision  function  of  the  KFD  classifier  is  written  as  sign(/(x))  where  f(x)  is 
defined  by: 


/(x)  =  w^(x)  +  h  =  ^ai.y!.K(xi,x)  +  h.  (2.63) 

i=\ 


This  is  equivalent  to  the  maximal  margin  hyperplane  in  the  input  space  defined  by  the 
kernel  (Cristianini  and  Shawe-Taylor,  2000).  The  goal  of  the  KFD  is  to  solve  for  a  and 
the  bias  b.  To  compute  the  bias  b,  Equation  (2.63)  is  rewritten  as  follows: 


Y4CckykK{xk,xi)  +  b  =  yi.  (2.64) 

k= 1 


Therefore,  the  bias  is  calculated  by  obtaining  the  average  as  (Scholkopf  and  Smola, 
2002): 


b 


(is  3 


v  k=i 


J 


(2.65) 
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In  order  to  reduce  the  number  of  false  positives  and  false  negatives  the  optimal  bias  in 
(2.65)  can  be  adjusted  accordingly.  In  this  case  the  bias  is  a  threshold  (Scholkopf  and 
Smola,  2002). 


Figure  2.10.  KFD  Decision  Boundary  using  RBF  Kernel. 


2.6.4  Parzen  Window 

Parzen  estimation  is  a  refinement  of  histogramming  (Parzen,  1962;  Fukunaga,  1990; 
Duda  et  ah,  2001;  Bishop,  2006).  The  basic  idea  behind  Parzen  window  estimation  is  that 
the  knowledge  gained  by  each  training  sample  x  of  the  input  space,  M",  is  represented  by 

a  function  centered  at  x  in  the  feature  space.  The  functions  themselves  are  represented 
with  the  use  of  a  distance  measure  or  a  kernel  estimator.  The  final  class  estimation  is 
derived  by  summing  the  results  from  the  kernel  functions  of  each  training  sample: 


(2.66) 
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For  example,  the  Parzen  window  density  model  is  optimized  by  maximizing  the 
likelihood  of  the  training  data  with  the  use  of  a  Gaussian  window  surrounding  each  input 
data  point.  The  Gaussian  window  can  be  represented  with  the  use  of  a  kernel  function 
as  an  interpolation  function  which  defines  an  inner  product  between  the  individual 

training  sample.  The  Radial  Basis  kernel  function  uses  a  window  width  parameter, a, 
which  is  also  known  as  the  spread  of  the  function: 


Pk  = 


X~X/ 

2cr2 


(2.67) 


This  results  in  a  sum  of  small  multivariate  Gaussian  probability  distributions  centered  at 
each  training  sample  x,  an  example  is  shown  in  Figure  2. 1 1 .  As  the  density  of  the  training 
samples  and  their  respective  Gaussian  distributions  increase  the  estimation  of  the 
probabilities  approach  the  true  probability  density  function  (PDF)  of  the  training  samples. 
The  estimation  for  classification  for  a  data  cluster  is  then  based  on  a  threshold  set  for  the 
combined  posterior  probability  from  all  samples.  The  classification  decision  assigns  the 
samples  to  the  class  with  maximal  posterior  probability  according  to  the  inequality: 


1 

A 


x~x,- 

2cr2 


exp 


f 


V 


|2  A 


J 


,V  h*j  (2.68) 


This  method  requires  a  reasonably  large  training  data  set  and  is  computationally 
inexpensive  during  training  but  is  computationally  expensive  for  testing.  During  testing 
the  kernel  function  must  be  computed  for  each  of  the  training  samples  making  a 
comparison  between  the  new  sample  xo  and  all  of  the  existing  training  samples  x.  Several 
kernel  approaches  have  been  proposed  in  literature  (Fukunaga,  1990;  Wand  and  Jones, 
1995).  The  kernels  were  originally  presented  by  Parzen  (1962). 
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Figure  2.11.  Parzen  Density  Estimator  with  RBF  window  with  Decision  Boundary. 

2.6.5  Probabilistic  Neural  Networks  (PNN) 

The  classification  frame  work  of  the  probabilistic  neural  network  is  shown  in  Figure  2.12 
(Specht,  1998;  1990).  There  are  a  few  decisions  that  have  to  be  made  regarding  training 
of  the  neural  network.  First,  the  number  of  training  samples  and  number  of  classes  are 
selected  for  the  pattern  layer;  this  defines  the  structure  of  the  network.  For  example,  the 
set  of  input  training  samples  is  represented  as  x  =  [xi,x2,...,x/]r  e  M"  and  a  class  label y, 

e  C  =  [ChC2,...,Cc\,  i  =  1,2, This  will  result  in  c  groups  with  each  group  in  the 

pattern  layer  containing  /  neurons.  Second,  for  the  summation  layer  the  smoothing 

parameter,  a,  in  the  nonlinear  operation  f[zt)  of  the  neural  network  must  be  determined. 
As  a  general  guideline  the  value  of  the  smoothing  parameter,  cr,  should  chosen  as  a 
function  of  the  dimension  of  the  problem,  n,  and  the  number  of  training  samples,  / 

(Specht,  1990).  The  structure  of  the  probabilistic  neural  network  classifier  has  three 
layers  as  shown  in  Figure  2. 12,  pattern  layer,  summation  layer  and  the  decision  layer.  The 
pattern  layer  forms  a  dot  product  of  the  input  features,  x,  with  the  weight  vectors,  wr, 
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resulting  in  z,  =  x*w;.  A  nonlinear  operation  /(z,)  on  z,  is  prefonned  prior  to  outputting  the 
activation  to  the  summation  level. 


/( z,)  =  exp 


z  -1 


<7 


(2.69) 


The  summation  layer  sums  the  inputs  from  the  pattern  layer  that  corresponds  to  the  class 
from  which  the  training  patterns  were  selected.  The  output  layer  returns  the  summation 
values  for  each  of  the  c  classes,  a  two-class  example  is  shown  in  Figure  2.13.  Each  output 
values  Pi,...,Pc  is  the  posterior  probability  that  the  sample  belongs  to  that  particular  class, 

c 

where  ^  P}  =  1 . 

j= i 
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Figure  2.13.  PNN  with  Decision  Boundary. 


2.6.6  Support  Vector  Machine  (SVM) 

SVM  performs  pattern  recognition  for  two-class  problems  by  determining  the  separating 
hyperplane  that  maximizes  the  distance  between  the  closest  points  of  each  class  in  the 
training  set  (Scholkopf  et  ah,  1998;  1999;  2002;  Burgers,  1998;  Vapnik,  1998;  Platt, 
2000;  FIsu  et  ah,  2006).  These  closest  points  are  called  support  vectors.  In  finding  the 
hyperplane,  the  SVM  performs  a  nonlinear  separation  in  the  input  space  by  using  a 

nonlinear  transformation  </>(x,)  that  maps  the  data  points  x,  of  the  input  space,  M”,  into  a 

potential  higher  dimensional  space,  called  kernel  space  M/  (/>  n).  The  mapping  0(x;)  is 

represented  in  the  SVM  classifier  by  a  kernel  function  K(xt,  x,)  that  defines  an  inner 
product  in  W. 


The  optimal  hyperplane  is  the  one  with  the  maximal  distance  (in  space  Bf)  to  the  closest 
points  (f){Xi )  of  the  training  data,  an  example  is  shown  in  Figure  2.14.  Determining  the 
hyperplane  requires  maximizing  the  following  function  with  respect  to  a 
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(2.70) 


w  («)  =  2>i  "  (x>,xy) 

z=l  z=l  y=l 

under  the  constraints  ^  g.y .,  /  =  1 .. . .,7-  The  non-negative  Lagrangian  multipliers  are 

ak  =  (al,...,a/  )  with  an  upper  bound  C,  C>a(  >0  .  The  Lagrangian  multipliers  are 
also  known  as  the  alpha  vectors. 

With  the  given  support  vectors  x;  and  class  labels  the  decision  function  of  the  SVM 
classifier  can  be  written  as  sign(/(x))  where  fix)  is  defined  by: 

f(x)  =  w0(x)  +  b  =  YjakykK(xk,x)  +  b  (2.71) 

k= 1 

This  is  equivalent  to  the  maximal  margin  hyperplane  in  the  input  space  defined  by  the 
kernel  (Cristianini  and  Shawe-Taylor,  2000).  The  goal  of  the  SVM  is  to  solve  for  a,  the 
bias  b  and  the  support  vectors  x/c.  To  compute  the  bias  b,  Equation  (2.71)  is  rewritten  as 
follows: 


es 

YjakykK(xk,xi)  +  b  =  yl 


k=\ 


(2.72) 


Therefore,  the  bias  is  calculated  by  obtaining  the  average  as  (Scholkopf  and  Smola, 
2002): 


b 


(Is  \ 


v  k= 1 


J 


(2.73) 


65 


In  order  to  reduce  the  number  of  false  positives  and  false  negatives  the  optimal  bias  in 
(2.73)  can  be  adjusted  accordingly.  In  this  case  the  bias  is  a  threshold  (Scholkopf  and 
Smola,  2002). 


Figure  2.14.  SVM  with  Optimal  Hyperplane. 


Solving  Equation  (2.70)  is  a  dual  quadratic  programming  (QP)  problem.  There  are 
several  methods  used  to  solve  the  quadratic  programming  problem,  including  Kernel- 
Adatron  (Friess  et  ah,  1998;  Cristianini  and  Shawe-Taylor,  2004),  LOQO  (Vanderbei  and 
Shanno,  1999)  and  sequential  minimal  optimization  (SMO)  (Cristianini  and  Shawe- 
Taylor,  2000;  Franc  and  Hlavac,  2007;  Mak,  2000;  Platt,  2000). 

Several  solutions  are  available  as  complete  SVM  systems  to  include  LIBSVM  (Chang 
and  Lin,  2001),  Matlab  Optimization  Toolbox  (2007)  and  SVMlight  (Joachims,  1998, 
2007).  Each  of  these  methods  has  individual  advantages  and  disadvantages  that  are 
beyond  the  scope  of  this  research. 
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2.7  Multi-Class  Classification 


In  the  previous  section  two-class  classifiers  were  described.  However,  in  many  real  world 
problems  there  are  cases  where  more  than  two  classes  exist.  The  classification  methods 
EM,  *-NN,  KFD,  Parzen  window,  probabilistic  neural  network  and  SVM  can  all  be 
modified  for  a  multi-class  solution.  EM  can  be  used  to  detennine  the  mean  and 
covariance  for  each  of  the  classes  individually  and  classified  using  the  Bayes  classifier. 
This  however,  has  the  disadvantage  of  producing  inaccurate  results  when  the  class 
distributions  are  not  normally  distributed  or  linearly  separable.  k-NN  can  also  be  trained 
to  solve  a  multi-class  problem.  Selecting  k-nearest  neighbors  of  the  input  vector  x  a  count 
of  the  training  samples  from  each  of  the  classes  can  be  used  to  determine  the  class  label 
of  x.  The  multi-class  case  performs  better  with  a  larger  number  of  input  training  vectors  x 
but  has  the  disadvantage  of  detennining  the  number  of  nearest  neighbors.  Unlike  the  two- 
class  case  where  better  perfonnance  is  achieved  for  large  k,  for  the  multi-class 
classification  this  is  not  always  true.  The  KFD  is  a  two-class  classifier  by  design.  It  could 
be  converted  into  a  multi-class  system  in  a  similar  manner  as  the  BSVM  (Hsu  et  al., 
2002).  In  this  research  only  the  two-class  KFD  will  be  used.  For  the  Parzen  window 
density  estimator  a  multi-class  solution  can  be  achieved.  As  with  the  two-class  case,  this 
method  is  easily  trained  but  computationally  expensive.  The  expense  is  in  terms  of  its 
processing  time  and  memory  allocation  when  the  number  of  samples  is  large.  For  a  multi¬ 
class  solution  the  larger  the  number  of  training  samples  per  class  the  better  the 
performance  is  achieved.  In  the  multi-class  case  of  the  SVM  two  methods  are  used  in 
which  the  margins  of  separation  are  determined  in  the  kernel  space,  BSVM  (Hsu  and  Fin, 
2002)  and  BSVM  2.0  (Hsu  et  al.,  2002).  BSVM  2.0  solves  the  multi-class  classification 
problem  for  the  solution  of  large  classification  and  regression  problems.  It  includes  three 
methods 

•  Multi-class  classification  by  solving  a  single  optimization  problem  using  a  bound- 
constrained  formulation. 
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•  Multi-class  classification  using  Crammer  and  Singer's  fonnulation  (Crammer  and 
Singer,  2000;  Crammer  and  Singer,  2001). 

•  Regression  using  a  bound-constrained  formulation 

While  each  of  these  methods  can  be  used  for  multi-class  classification  they  each  have 
disadvantages  when  compared  to  their  two-class  counterparts.  In  this  research  the  two- 
class  SVM  is  used  since  experimentation  has  shown  that  the  BSVM  2.0  begins  to  provide 
a  reduction  in  classification  accuracy  when  more  than  5  classes  are  used  for  the  clean  and 
stego  image  data  sets. 

In  several  multi-class  classification  methods  two-class  classifiers  are  combined  using 
one-against-one  and  one-against-all  (Fukunaga,  1990;  Duda  et  al.,  2001;  Tax  and  Duin, 
2002;  Lin  et  al.,  2003;  Bishop,  2006;  Theodoridis  and  Koutroumbas,  2006).  Learning 
architectures  are  used  to  combine  several  two-class  classifiers  in  order  to  create  a  multi¬ 
class  classifier.  In  these  methods  training  is  done  by  comparing  one  class  against  each  of 
the  other  classes  or  by  training  one  class  against  the  remaining  classes.  This  produces 
several  classifiers  in  which  a  winner  take  all  approach  is  used.  The  winner  take  all  assigns 
the  class  label  based  on  a  majority  vote  wins.  In  this  section  the  following  multi-class 
approaches  are  presented:  one-against-one  and  one-against-all  methods. 

2.7,1  One-Against-One 

In  one-against-one  each  class  is  trained  against  each  of  the  others.  The  goal  is  to  train  the 
multi-class  rule  based  on  the  majority  vote  strategy.  The  majority  votes  based  multi-class 
classifier  assigns  the  test  input  vector  x0  into  class  C  =  [Cj,C2,...]  having  the  majority  of 
the  votes.  This  is  a  fairly  reliable  method  assuming  that  the  feature  space  is  separable 
from  one  class  to  the  other.  Problems  arise  when  a  large  number  of  classes  are  being 
trained;  the  resulting  system  becomes  computationally  expensive  as  the  number  of 
classifiers  increases  factorially.  The  one-against-one  approach  constructs  k(k- 1  )/2 
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classifiers  from  two  different  classes  for  each  one  of  the  training  data  sets.  This  is  for 
training  data  from  the  zth  and  the /h  classes  which  has  k  classes.  As  an  example  consider  a 
case  with  10  classes,  k  =  10.  This  will  require  45  classifiers  to  be  trained.  In  most 
classification  systems  a  voting  strategy  is  used.  In  binary  classification  the  voting  strategy 
votes  are  cast  for  all  data  points  x  where  the  majority  number  of  votes  for  a  class  wins, 
“Max  Wins”.  This  may  lead  to  a  situation  where  two  classes  have  the  same  number  of 
votes.  One  approach  to  resolving  this  conflict  is  to  select  the  class  with  the  smallest  index 
(Hsu  et  al.,  2002). 

2,7.2  One-Against-All 

Several  articles  have  been  written  on  one-against-all  training  methods  (Liu  and  Zheng, 
2005).  The  one-against-all  method  trains  the  multi-class  problem  as  a  series  of  C,  two- 
class  subtasks  that  can  be  trained  by  any  two-class  classifier.  If  there  is  k  >  2-class 
exemplars,  k  2-class  classifiers  will  be  constructed  which  separate  one  class  from  all 
other  classes.  To  get  k-classifiers  it  is  common  to  construct  a  set  of  binary  classifiers  each 
trained  to  separate  an  individual  class  from  the  remaining  classifiers.  One  disadvantage  of 
this  method  is  with  a  significant  number  of  classifiers  a  large  number  of  two-class 
classifiers  will  need  to  be  compared.  When  grouping  all  of  the  classes  together  the 
classification  may  become  more  difficult  as  separating  the  one  from  all  of  the  rest  may 
not  lead  to  a  separation  between  the  classes,  and  lead  to  poor  classification  performance. 

2.8  Classifier  Fusion 

To  improve  the  classification  accuracy  for  the  multi-class  classification,  combining 
classifiers,  classifier  fusion,  may  prove  useful  on  the  overall  performance  of  the 
classification  system.  The  main  focus  of  recent  research  in  classifier  fusion  has  been  on 
establishing  the  relationship  between  the  diversity  of  the  classifiers  and  their  resulting 
accuracy/performance.  The  paradigms  of  the  different  models  differ  on  the  assumptions 
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about  classifier  dependencies,  type  of  classifier  outputs,  aggregation  strategy  either  global 
or  local,  aggregation  procedure  such  as  a  function,  a  neural  network  or  an  algorithm,  etc. 
(Kittler  et  ah,  1998;  Duin  and  Tax,  2000;  Ruta  and  Gabrys,  2000;  Duin,  2002,  Kittler, 
2002).  Three  methods  of  combining  classifiers  are  described  which  included  boosting, 
Bayes  networks  and  probabilistic  neural  network  combiners. 

2.8.1  Boosting 

Boosting  is  a  powerful  technique  for  combining  an  ensemble  of  base  classifiers  to 
produce  a  form  of  committee  whose  performance  can  be  significantly  increased  over  any 
of  the  single  classifiers.  The  most  widely  used  form  of  boosting  is  AdaBoost,  developed 
by  Freund  and  Schapire  (1995).  Boosting  provides  good  results  even  if  the  base 
classifiers,  are  weak  learners,  and  have  a  perfonnance  that  is  only  slightly  better  than 
random  (Freund  and  Schapire,  1999). 

The  primary  difference  between  boosting  and  bagging  is  that  the  base  classifiers  are 
trained  in  sequence,  and  each  base  classifier  is  trained  using  a  weighted  fonn  of  the  data 
set  in  which  the  weighting  coefficient  associated  with  each  data  point  depends  on  the 
performance  of  the  previous  classifiers.  In  particular,  points  that  are  misclassified  by  one 
of  the  base  classifiers  are  given  greater  weight  when  used  to  train  the  next  classifier  in  the 
next  sequence.  Once  all  the  classifiers  have  been  trained,  their  predictions  are  then 
combined  through  a  weighted  majority  voting  scheme.  AdaBoost  calls  a  given  weak  or 
base  learning  algorithm  repeatedly  in  a  series  of  rounds,  yt=  1,...,  £.  The  precise  form  of 

the  AdaBoost  algorithm  is  given  below: 

AdaBoost  Algorithm  (Bishop,  2006,  pp.  658) 

1 .  The  data  weighting  coefficients  { vv;j  are  initialized  as  w  ']  =  for  i  = 

2.  For  k  =  1 
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(a)  Fit  a  classifier  M^(\)  to  the  training  data  by  minimizing  the  weighted 
error  function 


<2-7i> 

i= 1 

where  /(M*  (x.)  ^  v, )  is  the  indicator  function  and  equals  1  when  A/fix,) 

^  Vi  and  0  otherwise. 


(b)  Evaluate  the  quantities 


= 


z 

1=1 


w. 


,(*) 


and  then  use  these  to  evaluate 


(2.72) 


(2.73) 

(c)  Update  the  data  weighting  coefficients 

wf+1)  =  (2.74) 

3.  Making  a  prediction  using  the  final  trained  model  for  an  input  image  sample  x0 
is  given  by 

/h.)=Z“*M*h0)  <2-75> 

Jfc=l 


The  first  base  classifier  Mi(x)  is  trained  using  weighting  coefficients  vv:l)  that  are  all 
equal,  which  corresponds  to  the  usual  procedure  for  training  a  single  classifier.  In  Step 
2(c),  subsequent  iterations  in  the  weighting  coefficients  w\'^  are  increased  for  data  points 
that  are  misclassified  and  decreased  for  data  points  that  are  correctly  classified. 
Successive  classifiers  are  forced  to  place  greater  emphasis  on  points  that  have  been 
misclassified  by  previous  classifiers,  and  data  points  that  continue  to  be  misclassified  by 
successive  classifiers  receive  even  greater  weight.  The  quantities  Sk  represents  the 

71 


weighted  measures  of  the  error  weights  of  each  of  the  base  classifiers  on  the  data  set. 
Therefore,  in  Step  2(b)  the  weighing  coefficients  ak  give  greater  weights  to  the  more 

accurate  classifiers  when  computing  the  overall  output  given  by  Step  3  (Bishop,  2006,  pp. 
658). 

2.8.2  Bayes  Network  for  Model  Averaging 

Bayes  model  averaging  merges  together  several  multi-class  classifiers  by  combining  the 
probabilistic  density  estimation  of  each  classifier’s  classification  accuracy  as  a  mixture  of 
Gaussians  (Hoeting  et  ah,  1999;  Murphy,  2001).  Murphy’s  (2001)  Bayes  Net  Toolbox 
(BNT)  for  Matlab  was  used  in  the  analysis  to  facilitate  the  computations  in  the  model 
averaging.  The  probabilistic  density  estimation  specifies  the  local  conditional  probability 
distributions  (CPD)  for  a  classification  model,  M*,  where  k  is  one  of  the  K  classifiers,  and 
M  is  the  set  of  all  classifiers.  The  CPD  of  each  model  M*  is  p{Mi\T).  This  represents  for 
each  class,  the  probability  of  what  a  classification  model  will  classify  a  target  instance  T 
as.  In  this  research  the  implementation  uses  confusion  matrices  which  represent  the 
correct  and  incorrect  classification  for  each  multi-class  classifier  providing  the 
probabilistic  density  estimation  for  each  classifier. 

The  fusion  process  uses  the  classifications  from  the  classification  models  (M),  in 
conjunction  with  Bayes  Rule,  to  compute  the  posterior  probability  for  each  target 
classification  T=  c: 


p(T  =  c\M)  =  r/Y[p{Mk  |  T  =  c)p(T  =  c) 

k=\ 


(2.76) 


where  rj  is  a  normalizing  constant. 
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The  final  classification  is  then  the  target  classification,  T=c,  with  the  highest  probability. 
The  prior  probability  of  p(T)  is  calculated  from  the  number  of  targets. 

2.8.3  Probabilistic  Neural  Network  (PNN)  Fusion 

The  fusion  method  in  this  work  is  an  extension  from  the  two-class  fusion  investigated  by 
Leap  et  ah,  (2007)  to  a  multi-class  system  fusion.  In  this  method  the  outputs  of  individual 
classification  systems  are  treated  as  input  features  to  train  a  probabilistic  neural  network 
(Specht,  1990)  for  fusion.  The  key  is  to  use  the  class  labels  from  each  of  the  systems  as 
posterior  probability  estimates  and  employing  them  as  features  in  the  neural  network.  It 
should  be  noted  that  one  of  the  posterior  probabilities  from  the  input  classifier  should  be 
removed.  For  example,  if  K  three-class  classifiers  are  used,  then  each  of  the  classification 
models,  M*,  will  contribute  two  inputs  for  training  the  PNN.  The  fusion  method  treats  the 
posterior  probabilities  from  individual  detection  systems  as  features  to  the  neural  network 
and  outputs  an  overall  posterior  probability  of  a  sample  as  being  in  a  given  class.  This 
fusion  does  not  impose  any  independence  assumptions  on  the  input  systems. 

2.9  Summary 

This  chapter  presented  the  key  elements  necessary  to  solve  the  steganalysis  multi-class 
classification  system  for  identifying  JPEG  steganography  embedding  methods.  JPEG 
image  representation  was  described  by  introducing  the  discrete  cosine  transfonn  and  the 
JPEG  image  format  and  it  compression  steps.  The  feature  generation  methods  described 
in  this  chapter  were  a  wavelet  based  method  and  a  discrete  cosine  transform  method.  In 
feature  preprocessing  outlier  removal,  data  nonnalization  and  data  standardization  were 
presented.  For  feature  extraction,  PCA  and  Kernel  PCA  were  described.  The  feature 
ranking/selection  method  presented  in  this  chapter  were  the  Bhattacharyya  distance, 
Fisher’s  linear  discriminant  ratio,  signal  to  noise  ratio,  kernel  Fisher’s  discriminant 
recursive  feature  elimination  and  the  zero-norm  feature  ranking.  In  classification  both 
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two-class  and  multi-class  classification  method  used  in  this  research  were  described;  the 
six  methods  used  are  EM,  /c-NN,  KFD,  Parzen  window,  probabilistic  neural  networks  and 
SVM.  A  section  was  devoted  to  improving  classification  perfonnance  with  classifier 
fusion  covering  boosting,  Bayes  networks  and  probabilistic  neural  networks. 

In  this  chapter  several  methods  have  been  described  that  are  essential  in  making  a 
comparison  with  the  proposed  overall  detection  method  described  in  Chapter  3.  Some  of 
the  methods  described  in  this  chapter  are  modified  to  accommodate  the  needs  of  the 
proposed  method.  In  other  cases,  the  methods  in  this  chapter  are  incorporated  into  the 
system. 
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ITT.  Methodology 


This  chapter  presents  a  multi-class  fusion  system  for  classification  of  steganographic 
methods.  This  detection  method  classifies  JPEG  images  based  on  generated  image 
features  whereby  previously  unseen  images  are  associated  with  exactly  one  element  of 
the  label  set,  i.e.,  clean  or  type  of  stego  image.  The  stego  image  consists  of  one  of  seven 
targeted  embedding  methods,  F5  (Westfeld,  2001;  2003),  JP  Hide  (Latham,  1999),  JSteg 
(Upham,  1993),  Model-base  (Sallee,  2003;  2006),  Model-based  Version  1.2  (Sallee, 
2008a),  OutGuess  (Provos,  2004),  Steganos  (2008),  StegHide  (Hetzl,  2003)  and  UTSA 
(Agaian  et  ah,  2006). 

Figure  3.1  shows  the  classification  system  developed  in  this  research.  The  image  set 
consists  of  clean  and  stego  images  that  have  data  embedded  using  one  of  nine  methods. 
Features  are  generated  from  each  image  and  each  feature  set  is  assigned  a  class  label 
identifying  the  embedding  method  used.  The  features  are  used  in  three  components  of  the 
multi-class  system.  See  Figure  3.1.  The  first  component  is  Multi-class  Detection  for 
EM/k-NN/Parzen/PNN.  The  existing  feature  improvement  methods  and  classifiers  are 
used  to  create  four  multi-class  detection  systems  that  each  return  a  class  label  assigned  to 
the  input  sample  (Rodriguez  and  Peterson,  2008a).  The  second  component  is  Multi-class 
Detection  for  KFD/SVM.  It  contains  a  new  feature  ranking  method  along  with  a  new 
multi-class  tree  to  generate  a  multi-class  classification  label  with  the  combination  of  two- 
class  classifiers.  The  third  component,  Commercial  Detection  Systems,  has  two 
commercial  steganalysis  tools  that  return  class  labels  for  a  variety  of  stego  methods.  The 
assigned  class  labels  for  8  multi-class  systems  are  fused  shown  as  Classifier  Fusion  in  the 
figure  and  a  final  class  label  is  assigned. 
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Figure  3.1.  Detection  System. 


This  chapter  presents  four  improvements  to  steganalysis  pattern  recognition.  The  first  is 
the  creation  of  new  features  generated  from  the  frequency  bands  and  directions  of  the 
Discrete  Cosine  Transform  (DCT)  coefficients  of  JPEG  images.  The  second 
improvement  is  a  new  feature  ranking  method.  From  the  original  input  feature  set,  it 
selects  a  subset  of  features  specifically  designed  for  the  kernel  Fisher’s  discriminant 
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(KFD)  and  the  support  vector  machines  (SVM).  The  third  improvement  is  a  multi-class 
classification  tree  designed  for  the  KFD  and  SVM  classifiers.  The  final  contribution  of 
this  steganalysis  classification  system  is  the  fusion  of  multi-class  classifiers.  These 
improvements  are  designed  to  increase  the  identification  of  embedding  methods  used  to 
create  stego  images. 

3.1  Feature  Generation 

This  section  details  the  novel  DCT  feature  generation  method.  Figure  3.2  illustrates  the 
main  components  of  the  novel  feature  generation  method. 


Figure  3.2.  General  Feature  Generation  System. 

The  first  component  builds  on  details  of  the  DCT  coefficient  representation  which  is  used 
in  a  decomposition.  Two  metrics  are  calculated  on  each  8x8  block  of  the  decomposed 
coefficients  in  a  JPEG  image.  The  first  metric  is  a  difference  calculation  that  compares 
DCT  coefficients  with  neighboring  coefficients.  The  second  metric  is  a  least  square  linear 
regression  metric  that  uses  DCT  coefficients,  shifted  coefficients  and  neighboring 
coefficients  to  calculate  weights  used  in  the  regression  model.  Statistics  (e.g.,  mean, 
variance,  etc.)  are  calculated  over  the  DCT  coefficients,  neighboring  coefficients,  shifted 
coefficients  and  the  metrics.  The  last  three  set  of  statistics  are  then  subtracted  from  the 
statistics  of  the  DCT  coefficients  creating  a  set  of  180  features  used  to  identify  clean  and 
stego  images. 


Input  Image 


JPEG 

DCT 

Coefficient 

Representation 
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3.1.1  DCT  Representation 


The  standard  DCT  used  in  JPEG  compression  has  two  properties,  i.e.,  the  directional  and 
frequency  distributions  of  8><8  blocks  within  an  image  (Rao  and  Yip,  1990).  In  JPEG 
compression  on  a  two  dimensional  (2-D)  signal,  the  zig-zag  scan  shown  in  Figure  3.3a  is 
used  to  take  advantage  of  the  frequency  distributions  of  the  DCT  shown  in  Figure  3.3b 
(Brown  and  Shepherd,  1995,  pp.  224).  The  DCT  decomposition  divides  the  coefficients 
into  low,  medium  and  high  frequencies.  Figure  3.3c  shows  the  breakdown  of  the  vertical, 
diagonal  and  horizontal  directions  of  the  coefficients.  In  this  research  both  the 
frequencies  and  directions  of  the  DCT  are  investigated  to  generate  features.  Figure  3.3d 
shows  an  8x8  image  with  a  horizontal  edge  between  black  and  white  pixels.  The 
corresponding  2-D  DCT  of  Figure  3.3d  is  shown  in  Figure  3.3g  which  has  coefficients 
that  are  prominent  along  the  first  column.  In  Figure  3.3e  an  image  is  shown  with  a 
diagonal  edge  between  black  and  white  pixels  with  a  corresponding  2-D  DCT  shown  in 
Figure  3.3h  which  has  coefficients  located  along  the  diagonal.  In  Figure  3.3f  an  image  is 
shown  with  a  vertical  edge  between  black  and  white  pixels  with  a  corresponding  2-D 
DCT  shown  in  Figure  3.3i  which  has  coefficients  located  along  the  first  row. 
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g)  h)  i) 

Figure  3.3.  DCT  decomposition  a)  zig-zag  scan  pattern  b)  low,  medium  and  high 


frequency  distributions  c)  vertical,  diagonal  and  horizontal  directions  d)  8X8  image  with  a 
horizontal  edge  between  pixels  e)  8X8  image  with  a  diagonal  edge  between  pixels  f)  8x  8 
image  with  a  vertical  edge  between  pixels  g)  2-D  DCT  representation  of  horizontal  image 
h)  2-D  DCT  representation  of  diagonal  image  i)  2-D  DCT  representation  of  vertical 

image. 
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3.1.2  Arrangement  of  Decomposed  DCT  Coefficients 


The  calculation  of  the  features  requires  rearranging  the  DCT  coefficients  in  three 
different  ways.  The  first,  DCT  decomposition  separates  coefficients  into  areas  within  the 
8X8  DCT  block,  three  frequency  bands  as  well  as  three  directions.  This  results  into  9 
areas  that  the  coefficients  are  decomposed  into  where  6  different  areas  are  used.  The 
second  is  a  set  of  coefficients  generated  by  shifting  the  8X8  pixel  blocks  in  the  spatial 
domain  and  recalculating  the  quantized  DCT  coefficients.  The  DCT  decomposition 
feature  method  is  then  used  over  these  shifted  blocks.  Three  different  shifting  operations 
are  used,  shifting  the  8X8  block  to  the  right  by  four  pixels  ( block  shift  right),  down  by 
four  pixels  {block  shift  down),  and  diagonal  by  four  pixels  {block  shift  diagonal).  The  last 
arrangement  of  the  DCT  coefficients  are  sets  of  neighboring  coefficients  within  an  8X8 
DCT  block  for  a  DCT  coefficient  of  interest. 

3. 1.2.1  Frequency  and  Directional  Coefficient  Vectors 

The  8x8  coefficient  values  are  represented  as  dh  (w,v)  where  u  =  v=  1, _ ,8,  b  =  1 

where  B  is  the  number  of  8x8  blocks  within  a  color  layer  of  an  image.  The  zig-zag 
pattern  shown  in  Figure  3.3a  is  used  to  translate  the  8><8  matrix  into  a  vector.  The  vector 
is  represented  as  db~,k=  1,. .  .,64,  and  the  locations  of  k  are  shown  in  Figure  3.4. 
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Figure  3.4.  DCT  Coefficient  Locations  and  Separations  a)  DCT  Coefficient  Location 


after  Zig-Zag  Scan  b)  Coefficient  Locations  of  Vertical,  Diagonal  and  Horizontal 


directions  c)  Coefficient  Locations  of  Low,  Mid  and  High  Frequencies  d)  8X8  block 
Coefficient  Separation  of  both  frequencies  and  directions. 
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The  coefficient  vector  indices  from  the  zig-zag  method  are  shown  in  Figure  3.4a.  The  DC 
coefficient  is  at  location  1  and  locations  2  through  64  are  the  AC  coefficients.  Figure  3.4b 
shows  the  separations  of  vertical  (red),  diagonal  (green)  and  horizontal  (blue)  DCT 
decompositions.  The  remaining  coefficients  correspond  to  the  high  frequencies  and  are 
normally  zero  due  to  the  quantization  compression  of  the  JPEG  method.  For  a  typical 
compression  of  JPEG  images  the  high  frequencies  correspond  to  the  black  cells  in  Figure 
3.4c.  The  DCT  decompositions  of  low  (white),  medium  (gray)  and  high  (black) 
frequencies  coefficients  are  shown  in  Figure  3.4c.  In  this  research  the  coefficients  will  be 
decomposed  as  shown  in  Figure  3.4d.  As  shown  in  Figure  3.4d  the  8x8  block  is  divided 
into  eight  DCT  decompositions  represented  by  both  the  frequency  distributions  and 
directions. 

The  coefficients  are  arranged  as  follows: 

•  The  combination  of  vertical  and  low  frequencies  (VL)  is  shown  as  red  in  Figure 
3.4d.  The  vector  dhVL  contains  the  DCT  coefficients  of  block  b  after  the  zig-zag 

scan  at  locations  2,  6,  7  and  8  such  that  the  vector  DVI  =  [dbL  \  b  =  1, . . .,  B  j . 

•  The  diagonal  and  low  frequencies  (DL)  are  shown  as  green  in  Figure  3.4d.  The 
vector  dbDL  contains  the  DCT  coefficients  of  block  b  at  locations  5  and  13  such 

that  Ddl  =  [dhDL  \b  =  \,...,B^. 

•  The  horizontal  and  low  frequencies  (ML)  are  shown  as  blue  in  Figure  3.4  d.  The 
vector  ^contains  the  DCT  coefficients  of  block  b  at  locations  3,  4,  9  and  10 

such  that  Dhl  =  [dhHL  1 6  =  1, . . ., f?) . 

•  The  vertical  and  mid  frequencies  ( VM)  are  shown  as  dark  red  in  Figure  3.4d.  The 
vector  dbM  contains  the  DCT  coefficients  of  block  b  after  the  zig-zag  scan  at 

locations  14,  15,  16,  17,  27,  28,  30,  and  31  such  that  Dvu  =  {dbM  \b  =  \,...,B^. 
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•  The  diagonal  and  mid  frequencies  (DM)  are  shown  as  dark  green  in  Figure  3.4d. 

The  vector  dhDM  contains  the  DCT  coefficients  of  block  b  after  the  zig-zag  scan  at 
locations  18,  19,  24,  25,  26,  32,  33,  39,  40  and  41  such  that 

Ddm  ={dbDM  1 6  =  h 

•  The  horizontal  and  mid  frequencies  (HAD)  are  shown  as  dark  blue  in  Figure  3.4d. 
The  vector  aHM  contains  the  DCT  coefficients  of  block  b  after  the  zig-zag  scan  at 

locations  11,  12,  20,  21,  22,  23,  34  and  35  such  that  DHM  =  | dhHM  \  b  =  1,...,Z?) . 

The  remaining  coefficients  of  Figure  3.4d  shown  in  black  are  not  analyzed  with  the 
decomposition  since  during  JPEG  compression  they  are  often  zero  valued  and  typically 
not  used  to  hide  a  stego  message  (Fridrich,  2004). 

3. 1.2.2  Block  Shifted  Coefficient  Vectors 

In  this  subsection  the  individual  8^8  blocks  of  an  input  JPEG  image  are  shifted  in  the 
spatial  domain  and  recompressed  using  the  JPEG  compression  technique.  Three  shifting 
techniques  are  used,  shifting  to  the  right,  down  and  right  and  down  each  by  four  pixels. 
The  coefficients  from  the  shifted  blocks  are  placed  in  vectors  as  in  subsection  3. 1 .2. 1 . 

The  first  set  of  shifted  coefficients  focuses  on  shifting  the  pixel  values  to  the  right  by  four 
pixels  in  the  spatial  domain  as  shown  in  Figure  3.5. 
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Shift  Right  by  4  Pixels 


Figure  3.5.  Original  block  and  right  shifted  pixel  locations 


The  original  block  containing  the  spatial  domain  pixels  is  transformed  using  the  JPEG 
compression  properties,  e.g.,  the  same  quantization  table  used  in  compression.  The  last 
column  of  blocks  has  no  neighboring  blocks  so  the  final  four  columns  of  pixels  in  the 
image  are  duplicated  to  ensure  B  shifted  blocks  exist. 


Using  the  same  vector  representation  of  the  DCT  coefficients  as  in  subsection  3. 1 .2. 1  for 
the  right  shifted  blocks  results  in  the  following  vector  representations: 

•  The  combination  of  vertical  and  low  frequencies  ( VL )  results  in  a  vector  sbVLRjght 
containing  the  DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  2,  6, 
7  and  8  such  that  the  vector  SVLRlght  =  (shvlJUght  \b  =  l,...,B). 

•  The  diagonal  and  low  frequencies  ( DL )  result  in  a  vector  shDL  Ri  ht  containing  the 
DCT  coefficients  of  block  b  at  locations  5  and  13  such  that 

SDL,mSh,={SDL,mght \b  =  l,...,B). 
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•  The  horizontal  and  low  frequencies  (HL)  result  in  a  vector  sbHL  m  ht  containing  the 
DCT  coefficients  of  block  b  at  locations  3,  4,  9  and  10  such  that 

S ' HL, Right  ~  i^HL, Right  I  b  =  . 

•  The  vertical  and  mid  frequencies  (VM)  result  in  a  vector  sbm  Right  containing  the 
DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  14,  15,  16,  17,  27, 
28,  30,  and  3 1  such  that  SmRight  =  [sbVMRight  \h  =  \,...,B). 

•  The  diagonal  and  mid  frequencies  ( DM)  result  in  a  vector  sbDM  Ri  ht  containing  the 
DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  18,  19,  24,  25,  26, 
32,  33,  39,  40  and  4 1  such  that  SDMRight  =  [sbDMRight  \b  =  l,...,B). 

•  The  horizontal  and  mid  frequencies  ( HM)  result  in  a  vector  sbHM  Right  containing 
the  DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  11,  12,  20,  21, 
22,  23,  34  and  35  such  that  SHMRight  =  (sbHMRight  \b  =  \,...,B). 

The  second  set  of  shifted  coefficients  focuses  on  shifting  the  pixel  values  down  by  four 
pixels  in  the  spatial  domain  as  shown  in  Figure  3.6. 
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Shift  Down  by  4  Pixels 


Figure  3.6.  Original  block  and  down  shifted  pixel  locations 


For  this  method  the  last  row  of  blocks  has  no  neighboring  blocks  so  the  final  four  rows  of 
pixels  in  the  image  are  duplicated  to  ensure  B  shifted  blocks  exist. 


The  down  shifted  blocks  results  in  the  following  vector  representations: 

•  The  combination  of  vertical  and  low  frequencies  ( VL )  results  in  a  vector  sbVLDown 
containing  the  DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  2,  6, 
7  and  8  such  that  the  vector  SVLDown  =  (sbvlJ)own  \b  =  l,...,B). 

•  The  diagonal  and  low  frequencies  ( DL )  result  in  a  vector  sbDL  Down  containing  the 
DCT  coefficients  of  block  b  at  locations  5  and  13  such  that 

SDL,Down  =  (SDL,Down\  b  =  ’  B)  • 
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•  The  horizontal  and  low  frequencies  ( HL )  result  in  a  vector  sbHL  Down  containing  the 
DCT  coefficients  of  block  b  at  locations  3,  4,  9  and  10  such  that 

^ HL, Down  ~  m., Down  I  b  —  l, . . . ,  B^j  . 

•  The  vertical  and  mid  frequencies  (VM)  result  in  a  vector  sbM  Down  containing  the 
DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  14,  15,  16,  17,  27, 
28,  30,  and  3 1  such  that  SVMDown  =  [sbVM  Down  \b  =  \,...,B). 

•  The  diagonal  and  mid  frequencies  ( DM)  result  in  a  vector  shDM  Dmm  containing  the 
DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  18,  19,  24,  25,  26, 
32,  33,  39,  40  and  41  such  that  SDM  Down  =  (sbDM  Down  \b  =  \,...,B). 

•  The  horizontal  and  mid  frequencies  (JIM)  result  in  a  vector  shHM  Down  containing 
the  DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  11,  12,  20,  21, 
22,  23,  34  and  35  such  that  SHM  Down  =  (sbHMDown  \b  =  l,...,B). 

The  third  set  of  shifted  coefficients  focuses  on  shifting  the  pixel  values  to  the  right  by 
four  pixels  and  down  by  four  pixels  in  the  spatial  domain  as  shown  in  Figure  3.7. 


87 


Original 


Shifting  the  blocks  diagonally,  the  last  row  and  column  of  blocks  have  no  neighboring 
blocks  so  the  final  four  rows  and  the  final  four  columns  of  pixels  in  the  image  are 
duplicated  to  ensure  B  diagonally  shifted  blocks  exist. 


The  diagonally  shifted  blocks  results  in  the  following  vector  representations: 

•  The  combination  of  vertical  and  low  frequencies  ( VL )  results  in  a  vector  sbnDi 

containing  the  DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  2,  6, 
7  and  8  such  that  the  vector  SVL  Diag  =  (shVL  Diag  \  b  =  1, . . . ,  . 

•  The  diagonal  and  low  frequencies  ( DL )  result  in  a  vector  sbDLDiag  containing  the 
DCT  coefficients  of  block  b  at  locations  5  and  13  such  that 

=(&.*«  I  *>  =  1. •••.■»)• 
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•  The  horizontal  and  low  frequencies  (HL)  result  in  a  vector  sbHL  Diag  containing  the 
DCT  coefficients  of  block  b  at  locations  3,  4,  9  and  10  such  that 

SHL,Diag  =(SHL,Diag  \  h  =  h  ■  ■  ■  ,  B)  ■ 

•  The  vertical  and  mid  frequencies  (VM)  result  in  a  vector  sbVM  Diag  containing  the 
DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  14,  15,  16,  17,  27, 
28,  30,  and  3 1  such  that  SVMDiag  =  (sbVM  Diag  \b  =  l,...,B). 

•  The  diagonal  and  mid  frequencies  (DM)  result  in  a  vector  sbDM  Di  containing  the 
DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  18,  19,  24,  25,  26, 
32,  33,  39,  40  and  41  such  that  SDMDiag  =  ( sbDMDiag  \b  =  \,...,B). 

•  The  horizontal  and  mid  frequencies  (HM)  result  in  a  vector  sHM  Diag  containing  the 
DCT  coefficients  of  block  b  after  the  zig-zag  scan  at  locations  11,  12,  20,  21,  22, 
23,  34  and  35  such  that  SHMDiag  =  (sbHM Diag  \b  =  \,...,B). 

3. 1.2.3  Neighboring  Coefficient  Matrices 

Each  DCT  coefficient  has  a  corresponding  vector  of  neighboring  coefficients.  For  a 
coefficient  of  interest  in  an  8X8  block,  the  neighboring  coefficients  are  defined  as  its 
surrounding  coefficients.  The  six  vectors  representing  the  directional  and  frequency 
coefficients  described  in  subsection  3. 1.2.1  each  have  a  matrix  of  neighboring 
coefficients. 

The  vectors  and  matrices  of  neighboring  coefficients  are  as  follows: 

•  For  the  vertical  directions  and  low  frequencies  vector  dVL  when  VL  =  2  the 

coefficient  at  location  d2  has  corresponding  neighboring  coefficients  1,  6,  7,  8  and 
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14  represented  by  the  vector  hb2k  =[1  6  7  8  14],  kvL  =  1,...,5.  The  matrix  of 


neighboring  coefficients  for  dbVL  is  as  follows: 


lVL,kn 


1 

S3 

“H, 

si 

i _ 
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14' 

H6,kVL 
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14 

15 

17 
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15 
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27 

nb 

_Uvl  ] 

_5 

14 

17 

18 

26 

such  that Nvl  =  (nbVLM  \b  =  l,...,B) 


The  matrix  of  neighboring  coefficients  for  the  horizontal  directions  and  low 
frequencies  vector  dDL  are  represented  as  follows: 


DL,kn 


"5,kn 


l13  ,kD 


1  8  9  13  25 
5  18  19  25  40 


such  that  Ndl  =  (nbDLkoL  \b  =  \,...,B) 


•  The  matrix  of  neighboring  coefficients  for  the  horizontal  directions  and  low 
frequencies  vector  dbHL  are  represented  as  follows: 


such  that  Nhl  =  (nbHLk[iL  \b  =  \,...,B) 


•  The  matrix  of  neighboring  coefficients  for  the  vertical  directions  and  the  medium 
frequencies  vector  dbVM  are  represented  as  follows: 
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such  that  Nvm  =  (nm±VM  I  b  =  \,...,B) 
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The  matrix  of  neighboring  coefficients  for  the  horizontal  directions  and  medium 
frequencies  vector  dhDM  are  represented  as  follows: 
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•  The  matrix  of  neighboring  coefficients  for  the  horizontal  directions  and  medium 
frequencies  vector  dhHM  are  represented  as  follows: 
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such  that  Nhm  =  (nhHMtkim  \b  =  \,...,B) 


The  arrangement  of  the  coefficients  into  the  vectors  D,  SRight,  SDown,  SDiag  along  with  the 
matrices  N  will  be  used  to  calculate  the  metrics  in  the  next  sect  and  used  to  calculate 
statistics  necessary  for  generating  the  features. 
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3.1.3  Metrics  Calculation 


In  this  subsection  two  metrics  used  to  compare  coefficients  are  described.  The  first  is  a 
difference  calculation  that  compares  DCT  coefficients  with  neighboring  coefficients.  The 
second  metric  is  a  least  square  linear  regression  metric  that  uses  DCT  coefficients,  shifted 
coefficients  and  neighboring  coefficients  to  calculate  weights  used  in  the  regression 
model. 


3. 1.3.1  Mean  Difference  between  DCT  Coefficients  and  Neighboring  Coefficients 

The  mean  difference  metric  between  the  DCT  coefficients  in  subsection  3. 1.2.1  and  the 
neighboring  coefficients  from  subsection  3. 1.2. 3  are  described  in  this  subsection.  Vectors 
are  generated  for  the  three  directions  and  the  frequencies. 


The  mean  differences  are  calculated  as  follows: 

•  Vertical  direction  and  low  frequencies 

dyL  =]-'L  (dyL  ~  nVL,krL  )  such  that  Ax  =  {dyL\b  =  l,...,B) 

j  kVL=  1 

•  Diagonal  direction  and  low  frequencies 

dm  =  \  Z  (A/.  - nbDLkm )  such  that  Ddl  =  [dbDL  \  b  =  1, . . . , b) 

•  Horizontal  direction  and  low  frequencies 

dh  =  \  Z  (A/.  - nHL,k„, )  such  that Dhl  =[dbHL  \b  =  l,...,B) 

^  kHL=\ 

•  Vertical  direction  and  medium  frequencies 

d-vM  =  ty"  \d-vM  —  nvM,km )  such  that  DVM  =  (dVM  \  b  =  1, . . . , 

•  Diagonal  direction  and  medium  frequencies 


(3.1) 


(3.2) 


(3.3) 


(3.4) 
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dDM  =  \  Z  (dDM  ~  nDM,kDM  )  SUCh  that  DDM=(dhDM\b  =  l..;B)  (3.5) 

O  k  -1  ' 

kDM  -1 

•  Horizontal  direction  and  medium  frequencies 

dL  =  \  Z  (i  - )  such  that  Dhm  =  (<iM  |  h  =  1,. . b)  (3.6) 

'  kHM=\ 


3. 1.3.2  Least  Squares  Linear  Regression 

Regression  analysis  is  used  to  assess  the  relationship  between  dependent  variables  and 
one  or  more  independent  variables.  The  independent  variables  are  known  as  predictor 
variables.  To  avoid  confusion  in  this  chapter,  the  independent  variables  are  the 
neighboring  and  shifted  coefficients  while  the  predictor  variables  are  the  DCT 
coefficients.  The  coefficients  in  this  section  are  used  to  calculate  the  least  square  linear 
regression  metric  (Legendre,  1805,  Gauss,  1809,  pp.  205-224;  Davis,  1809/1857,  pp. 
249-273;  Dillon  and  Goldstein,  1984,  pp.  209-250;  Draper  and  Smith,  1998;  Neter  et  ah, 
1996).  The  idea  is  to  predict  the  mean  value  of  the  dependent  variables  (in  this  case  DCT 
coefficients)  on  the  basis  of  the  fixed  neighboring  coefficients  and  shifted  coefficients. 
The  regression  model  with  multiple  variables  in  N  is  written  as 


D  =  PQ+(3lNx+P2N2+-  (3.7) 

where  fio  is  referred  to  as  the  intercept  coefficient  and  the  remaining  /7s  are  the  slope 
coefficients  which  gives  the  change  in  D  with  respect  to  N.  Thc/7  s  are  calculated  as 


p  =  [NT n]X  NT D 


(3.8) 
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The  intercept  coefficient  J30  in  Equation  (3.7)  cannot  be  calculated  using  Equation  (3.8) 
for  D  and  N.  To  solve  this  problem  a  column  vector  of  l’s  is  added  to  the  front  of  the 
matrix  N.  The  column  of  l’s  allows  the  regression  model  to  contain  the  term  /30.  If  the  j3o 
tenn  is  omitted  from  the  regression  model,  the  response  of  the  model  is  zero  when  all  of 
the  predictor  variables  are  zero.  In  a  straight  line  regression  model  the  line  has  a  zero 
intercept  when  /?0  =  0  resulting  in  a  poor  model  (Draper  and  Smith,  1998). 


The  vectors  for  the  regression  metric  in  this  subsection  for  the  three  directions  and 
frequencies  are  calculated  as  follows: 

•  Vertical  direction  and  low  frequencies 


Nvl  |_1  s  Nvl  SVL  Righl  SVL  Down  Syi  Diag  J 
B  =  (  Nt  N  'l  *  Nt  D 

h’VL  \lyVLlyVL  )  lyVL±J,VL 

dvl=nvJvl 

•  Diagonal  direction  and  low  frequencies 

NqL  ~~  S  N dl  $ DL,Right  ^  DL, Down  ^ DL.Diag 

B  =  (nt  N  'l  '  Nt  D 

HdL  \iy  DLiy  DL  f  iy  DL^DL 

Ddl  =  N dlPdl 

•  Horizontal  direction  and  low  frequencies 

NhL  —  S  N hl  S HL.Right  ^  HL, Down  ^HL.Diag 

B  =  (nt  N  Nt  D 

HHL  \ly  HLly  HL  )  ly  HL^HL 

4a  =  NhJhl 

•  Vertical  direction  and  medium  frequencies 


N VM  ^  S  NVm  ^VM, Right  $ VM,Down  ^ VM,Diag 


B  =(nt  N  )  NT  D 

HVM  \iy  VMiy  VM  )  iyVM±y  1 

Dm  —  NvmPvm 
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•  Diagonal  direction  and  medium  frequencies 

N DM  —  S  N dm  S dm, Right  S dm, Down  ^ DM  ,Diag 

R  =(  Nt  N  )'  Nt  D 

r  DM  \iy  DM iy  DM  )  1  y  DM  ^  DM 

DqM  ~  N dm  P DM 

•  Horizontal  direction  and  medium  frequencies 

NhM  —  S  N hm  ^HM, Right  SHM,Down  ^ ' HM ,Diag 

R  =(nt  N  )  l  Nt  D 

RHM  \ly  HMly  HM  J  ly  HM^HM 

Dhm  —  N hmPhm 


For  each  of  the  coefficients  investigated  in  F  a  set  of  neighboring  coefficients  were 
selected  based  on  experimental  analysis  and  an  understanding  of  both  JPEG  compression 
and  how  the  embedding  methods  alter  the  coefficients.  Determining  the  number  of 
neighboring  coefficients  can  be  expanded  to  sequential  selection  used  in  regression,  e.g., 
backward  selection,  forward  selection  and  stepwise  selection  (Dillon  and  Goldstein, 
1984). 

3.1.4  Statistics  Calculation 

By  using  the  metrics  derived  from  the  previous  subsection,  the  statistics  are  calculated 
over  the  vectors  in  subsection  3.1.2  and  3.1.3  in  order  to  generate  the  features.  Table  3.1 
lists  five  statistics:  mean,  standard  deviation,  skewness,  kurtosis,  and  entropy  along  with 
their  calculation. 
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Table  3.1.  Test  statistics  for  generating  features. 


Test  Statistic 

Statistical  Function  F(-) 

Mean 

F„(D)  =  fl(D)=1-tD, 

n  ,-=I 

Standard  Deviation 

F„(D)  =  CT(D)  =  fi£(A-/<(fl))2l 

\n  ;=i  ) 

Skewness 

Fr(Z>)  =  ^<(Z>)=», 

<t(£>) 

Kurtosis 

z(a  -ml 

F  (D)  =  k(D)=  i=l 

y  ’  1  ’  a(D)4 

Entropy 

A(o)  =  £(o)--i(A)log(A) 

1=1 

3.1.5  Features 

The  new  feature  generation  method  produces  a  total  of  180  features  for  an  input  image. 
By  taking  the  differences  between  the  calculated  statistics,  the  number  of  features  in  the 
following  is  dependent  on  the  DCT  decomposition  and  the  selected  coefficients  as 
described  in  subsection  3.1.2  through  3.1.4.  D  includes  the  coefficient  vectors  in  3. 1.2.1, 
D  is  the  regression  model  described  in  3. 1.3. 2,  D  are  the  mean  differences  in  3. 1.3.1, 
N  contains  the  average  of  the  neighboring  coefficients  in  3. 1.2.3,  S Right,  Soown  and  Soiag 
are  block  shifted  coefficient  vectors  in  3. 1.2.2,  and  the  statistical  calculation  functions  F(-) 
are  described  in  3.1.4. 

F  (D)  -f(JS\  generates  30  features 
F(  D)-  f( I) j  generates  3 0  features 
F(D)-F(N^)  generates  30  features 
F(D)~  F^SRight  ^j  generates  30  features 
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F  ( D  )  -  F  (  SDown )  generates  3  0  features 
F(  D)-  F( SDiag  j  generates  30  features 

These  are  denoted  as  raw  features.  The  three  detection  systems  which  are  going  to  be 
described  in  the  following  sections  consider  these  features  as  inputs  in  order  to  achieve 
the  goal  in  this  research. 

3.2  Feature  Ranking/Selection 

The  previous  section  presents  a  feature  generation  method  that  results  in  180  features  that 
identify  the  difference  between  clean  and  stego  images.  Some  of  these  features  separate 
the  clean  from  stego  images  better  than  others.  In  this  section  a  new  feature  ranking 
method  for  two-class  kernel  Fisher’s  discriminant  and  support  vector  machines  classifiers 
is  described  that  identifies  the  best  features  to  use  for  accurate  classification  (Rodriguez 
et  ah,  2008a). 

3.2.1  SVM-Kernel  Feature  Ranking  (KFR) 

SVM-KFR  consists  of  a  three-step  feature  ranking  strategy  to  choose  representative 
features  and  remove  noisy  features  for  a  data  set  with  multiple  features.  The  first  step  is 
to  remove  one  feature  at  a  time  from  the  training  data  set.  Specifically  remove  feature  m 
from  x,-  denoted  as  x/m),  where  (m)  indicates  the  removed  feature  m.  The  second  step  is  to 
solve  Equation  (2.70)  to  identify  the  support  vectors,  x/c,  and  the  non-negative  alpha 

vectors,  C  >  a,  >  0.  Once  the  support  vectors  are  identified  the  kernel  matrix  is 
calculated  as: 


(3.18) 
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The  final  step  multiplies  the  kernel  matrix  with  the  m  feature  removed,  K ( xA. ,  \ . )  by 

the  alpha  vectors,  or  ,  and  associated  class  labels  y.  By  rewriting  Equation  (2.71)  the 
multiplication  results  in  the  following  solution: 

/(xl"!)) =  Yja{k']ykK(m)  (**>*,)+&  •  (3-19) 

k- 1 

This  projection  results  in  approximated  class  labels  without  the  bias  shown  in  Equation 
(2.70).  In  the  event  a  feature  with  strong  class  separability  is  removed  an  incorrect 
estimate  results.  As  an  example,  consider  a  nonlinearly  separable  set  of  50  samples  with 
100  features  and  equal  number  of  classes.  Figure  3.8  shows  the  mixture  of  classes  when  a 
strongly  separating  feature  is  removed.  The  x-axis  in  Figure  3.8  represents  the  index,  j,  of 

sample  x^  and  the  y-axis  represents  the  predicted  class  value, for  each  sample 

after  calculating  Equation  (3.19)  where  the  alpha  values  are  in  the  range  ofO  <  a^"]  <  6  . 
The  range  is  detennined  by  the  upper  bound  C  when  solving  Equation  (2.70). 
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O') 


is  removed. 


On  the  other  hand  if  a  noise-like  feature  in  class  separability  is  removed  the  two  classes 
show  a  separation.  Figure  3.9  shows  the  result  of  removing  a  weak  ranked  feature. 
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O') 


Figure  3.9.  One  dimensional  mapping  of  Equation  (3.19)  when  the  weak  ranked  feature  is 

removed. 


For  ranking  purposes,  the  projection  of  the  samples  /(Vm)  j  is  summed.  The  problem 

arises  when  positive  and  negative  values  are  summed  resulting  in  potential  cancelation  of 
the  results.  Because  of  this,  the  labels  v,  in  Equation  (3.19)  are  excluded  from  the 
decision  function  as 


/  (XW )  =  £ a[;]K(m)  (x[ffl) ,  x« )  +  b  . 

k=  1 


(3.20) 


The  solution  for  the  ranking  can  be  defined  as  the  summation  of  Equation  (3.20)  resulting 
in  a  ranking  value  for  feature  m  as  follows: 


R  = 


7=1  k= 1 


(x<ffl),XW) 


+  b 


(3.21) 
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where  a}'"*  contains  the  weights  for  the  support  vectors.  It  is  important  to  note  that  only 
the  support  vectors  are  used  to  calculate  A'1'"1  during  the  ranking  process  implying  that 
x-."'-1  =  x(/"') .  A  nonnalizing  factor  of  1/f  can  be  applied  to  the  ranking  values  Rm  for  the 

feature  ranking  criterion  but  is  not  required.  The  algorithm  of  this  method  is  provided  as 
follows: 

1 .  For  each  of  the  n  features  perform  steps  2,  3  and  4. 

2.  Remove  the  current  feature  m  from  the  data  set  x,  and  train  the  SVM  model, 
extracting  the  a-vectors  and  the  support  vectors  x*. 

3.  Calculate  K°"]  using  the  support  vectors  x*.  from  step  2 

4.  Assign  a  ranking  value  Rm  according  to  Equation  (3.21)  and  replace  the  feature  m. 

5.  After  completion  of  the  loop,  sort  the  ranking  values  Rm  in  descending  order. 

6.  Select  the  r  highest  ranked  features  for  training  the  SVM  classification  model. 

Equation  (3.20)  estimates  the  effect  of  the  optimization  solution  in  Equation  (2.70)  by 
removing  one  feature  at  a  time.  The  summation  of  the  mapping  a['"'lK{m)  ^x^,x^  j  in 

Equation  (3.21)  seeks  to  maximize  the  distance  between  classes,  C  =  {-1,  +1}.  To  explain 
the  mathematical  representation  of  the  ranking  criterion  in  Equation  (3.21),  it  is  necessary 
to  re-examine  fix)  from  Equation  (2.71)  which  denotes  the  solution  for  classification 
determined  by  the  values  of  the  vector  a  and  the  bias  b  at  a  particular  stage  of  the 
learning.  Letting 


Ej  =  f  (X;  )  -  yS  =  Z  akyk K  (XA  >  x7- ) + 6 


k= 1 


-y  i 


(3.22) 
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be  the  error  difference  between  the  function  output  and  the  target  value  (Cristianini  and 
Shawe-Taylor,  2000)  on  the  training  data  x,  it  is  possible  to  show  the  relationship 
between  Equations  (2.71)  and  (3.21).  For  an  ideal  case  the  desired  value  of  Et  would  be  0. 
The  goal  is  to  retain  the  features  that  approximate  the  sum  as  follows 

(3.23) 

j= i  v  *=i  ) 

Setting  the  ideal  situation  of  Ej  equal  to  0  the  following  equation  is  used 

( i.  \ 

EJ  =  z  a^kK (x, , xy )  +  h  - =  0  (3 .24) 

V  k=\  7 

where  j  =  1, ...,  / .  Using  the  absolute  values  of  va  andy,  results  in  the  following  equation: 

+  =  £  U  (3.25) 

J'= 1  V 

which  is  similar  to  Equation  (3.23).  When  a  feature  is  removed,  a  larger  ranking  indicates 
a  prediction  farther  away  from  the  true  class,  y,-.  The  R,„  criterion  allows  a  view  of  how 
well  the  SVM  model  separates  the  space  in  the  absence  of  the  removed  feature. 

Figures  3.10  and  3.11  show  the  values  of  the  decision  function  j  when  the  highest 

and  lowest  ranked  features  are  removed.  The  axes  of  Figure  3.10  and  3.11  represent  the 
index  of  sample  x-”1  on  the  x-axis  and  the  y-axis  represents  the  value  of  the  sample  after 
calculating  Equation  (3.21).  In  Figure  3.10  the  top  ranked  feature  is  removed  showing  the 
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space  between  0  and  6.  When  Equation  (3.21)  is  calculated  this  results  in  a  large  ranking 
value. 


o 


Figure  3.10.  One  dimensional  mapping  of  Equation  (3.20)  when  the  highest  ranked 

feature  is  removed. 


In  Figure  3.11  the  lowest  ranked  feature  is  removed  showing  the  space  converging  on  1. 
This  will  result  in  a  ranking  value  approximately  equal  to  I y.  I  indicating  a  low  ranking. 
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o 


Figure  3.11.  One  dimensional  mapping  of  Equation  (3.20)  when  the  lowest  ranked 

feature  is  removed. 

While  the  space  is  not  perfectly  separated  in  Figure  3.10  and  3.11,  the  reader  should  be 
aware  of  the  fact  that  the  figures  do  not  show  a  mapping  of  a^K(m)  (xi5x; )  +  bwith  the 

top  ranked  features.  The  two  figures  are  shown  to  give  an  insight  of  the  effects  a  removed 
feature  has  on  the  mapping  from  the  input  space  to  the  ranking  space  using  Equation 
(3.21). 

In  Figure  3.12  the  top  25%  of  the  ranked  features  are  kept.  In  this  simple  example  Figure 
3.12  shows  that  maintaining  the  top  ranked  features  the  error  function  in  Equation  (3.22) 
can  be  trained  to  zero. 
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Figure  3.12.  One  dimensional  mapping  of  Equation  (3.19)  when  the  top  25%  of  the 

ranked  features  are  kept. 

This  method  takes  advantage  of  the  classification  decision  function  Equation  (2.71).  The 
simplicity  of  this  method  makes  it  ideal  for  inclusion  in  most  kernel  based  classifiers  with 
decision  function  similar  to  Equation  (2.71).  In  the  next  subsection  this  ranking  method  is 
applied  to  the  kernel  Fisher’s  discriminant  classifier. 

3.2.2  Kernel  Fisher’s  Discriminant  Classifier  Kernel  Feature  Ranking  (KF-KFR) 

The  same  application  in  Section  3.2  can  be  extended  to  ranking  features  for  the  KFD 
classifier.  The  first  step  is  to  calculate  the  initial  alpha  vectors  as  follows: 


a 


(3.26) 
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where 


and 


/ 


i  j= i 

yeC_| 


/ 


+i  y=i 
i£C+i 


(3.27) 


(3.28) 


where  C  =  {C_i,C+i}  =  {-1,+1}.  Mika,  et  al.  (1999)  discuss  Numerical  issues  and 
Regularization  regarding  the  calculation  of  Equation  (3.28).  This  is  resolved  by  simply 
adding  a  multiple  of  the  identity  matrix  to  N  defined  as: 

=N{n)+/iI  (3.29) 

The  next  step  is  to  use  the  alpha  vectors  and  the  kernel  matrix  to  project  the  n- 1 
dimensional  input  feature  space  into  a  one  dimensional  space  as  follows: 

x  =  K{m)  (x;.,xy. )«('") .  (3.30) 

The  projection  in  Equation  (3.30)  now  becomes  the  space  that  is  to  be  solved  using  an 
optimization  solution.  Mika,  et  al.  (1999)  use  the  Matlab  Optimization  Toolbox  (Matlab, 
2007)  to  solve  the  optimization  problem  with  the  projected  space  calculated  in  Equation 
(3.30).  For  the  interested  reader  the  optimization  problem  is  described  in  detail  on  pp. 
460-462  of  (Scholkopf  and  Smola,  2002).  In  this  paper  the  one  dimensional  SMO  (Franc 
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and  Hlavac,  2007)  is  used  as  the  optimization  solution.  This  results  in  the  non-negative 
alpha  vectors  aj  =  (a],...,d/ )  with  an  upper  bound  C,C  >  ae  >  0  .  The  support  vectors  for 

the  KFD  trained  model  are  xa  =  x,  and  the  decision  function  of  the  KFD  classifier  is 
written  as  sign(/(x))  where  /(x)  is  defined  by: 

/(x)  =  w^(x)  +  h  =  ^a[.ji^(xi,x)  +  h.  (3.31) 

i— 1 

The  bias  b  is  calculated  by  obtaining  the  average  as  in  Equation  (2.65).  The  final  step  is 
to  rewrite  Equation  (3.21)  to  calculate  the  ranking  values  as  follows: 

R,  =  (3.32) 

7=1  1=1 

The  algorithm  for  the  kernel  Fisher’s  feature  ranking  method  is  as  follows: 

1 .  For  each  of  the  n  features  perform  steps  2,  3  and  4. 

2.  Remove  the  current  feature  m  from  the  data  set  x,  training  the  KFD  model  using 
Equations  (3.26)  through  (3.30)  to  obtain  the  alpha  vectors,  support  vectors  and 
bias. 

3.  Assign  a  ranking  value  Rm  according  to  Equation  (3.32)  and  replace  the  feature  m. 

4.  After  completion  of  the  loop,  sort  the  ranking  values  Rm  in  descending  order. 

5.  Select  the  r  highest  ranked  features  for  training  the  final  KFD  classification  model. 

The  procedure  is  conducted  for  each  feature  and  ranked  in  descending  order  where  the 
largest  value  corresponds  to  the  feature  of  most  importance.  It  should  be  noted  that  the 
calculation  of  the  alpha  weights  in  Equation  (3.26)  is  an  important  step  when  ranking  the 
features. 
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3.3  Learning  Decision  Trees  using  Kernel  mapping  for  creating  Multi-class 
Classification  from  two-class  KFD  and  SVM  Classifiers 

In  this  section  a  multi-class  tree  structure  for  performing  multi-class  classification  with 
two-class  KFD  and  SVM  classifiers  is  described.  The  structure  and  learning  of  the  tree  is 
known  as  a  learning  decision  tree  (Russell  and  Norvig,  2003).  Designing  the  structure  of 
the  tree,  at  each  node  a  distance  measure  in  the  kernel  space  is  calculated  between  three 
or  more  classes.  A  branch  connects  two  nodes  within  the  tree.  Branches  are  added  from 
each  node,  known  as  a  parent  node,  so  long  as  more  than  one  class  remains.  A  leaf  node 
from  a  parent  node  specifies  the  class  value  when  a  single  class  is  reached,  that  is,  a  node 
with  no  successor  in  the  tree.  The  depth  of  the  tree  is  determined  by  the  number  of  nodes 
along  a  path  from  the  top  parent  node  to  a  leaf  node.  For  example,  Figure  3.13  shows  the 
tree  structure  for  a  ten-class  problem  where  the  labels  represent  the  individual  classes,  1  = 
Clean,  2  =  F5,  3  =  JP  Hide,  4  =  JSteg,  5  =  Model-based,  6  =  Model-based  Ver.  1.2,  7  = 
OutGuess,  8  =  Steganos,  9  =  StegHide,  10  -  UTS  A. 


Figure  3.13.  Decision  tree  for  a  10-class  classification  problem  with  10  leaf  nodes,  9 
parent  nodes  and  a  maximum  depth  of  6. 

The  top  node  labeled  with  [123456789  10]  is  at  the  first  level  of  the  tree  and  the 
parent  node  of  nodes  labeled  as  [1  2  3  4  5  6  8]  and  [7  9  10].  The  leaf  nodes  from  left  to 
right  in  this  tree  are  label  as  [1],  [2],  [3],  [4],  [5],  [6],  [8],  [7],  [9]  and  [10],  This  tree  has  a 
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maximum  depth  of  6  which  is  the  path  from  the  parent  node  [1  2  3  4  5  6  7  8  9  10]  to  [5] 
or  [6], 


For  this  problem  there  are  several  steps  in  learning  the  tree.  The  first  step  is  to  map  the 
input  training  set  x,  =  [xi,  x2,...,xj  e  M"  ,  i  =  l,...,f,  0(x;):  X— >F  from  input  space  into  a 


potential  higher  dimensional  space  F  e  M/  called  kernel  space.  The  mapping  0(x,)  is 


represented  by  a  kernel  function  K(\t,  xy)  that  defines  an  inner  product  in  Each  sample 


in  the  training  set  contains  one  target  value  v,-  e  C  =  [C\,C2,...,Cc],  i  =  1, 2, which 

describes  the  class  to  which  the  sample  is  a  member  of.  The  parameters  for  calculating 
the  kernel  matrix  are  important  when  training  the  tree  and  the  two-class  classifiers  at  each 
node.  The  kernels  used  in  this  research  are  as  follows 


2.  polynomial:  .fir(x.,xy)=(/xfxy+r)  ,y>0 

3.  radial  basis  function  (RBF):  7f(x,,x/J=J 


4.  sigmoid:  ^(x;,xy)=tanh(/xfxy+r) 
where,  y,  r,  and  d  are  kernel  parameters. 

The  distance  measure  used  in  this  section  is  an  expansion  of  the  KFD  (Mika  et  ah,  1999). 
The  second  step  is  to  calculate  the  initial  alpha  vectors  for  a  multi-class  problem.  The 
alpha  vectors  are  defined  as  follows: 


„  M 
a  =  — 


(3.33) 
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where 


c— 1  c 


M  = 


c(c-l)/ 2] 


II( 

P=1  q=2 


MC„  -Mr 


=  —  Z^(X”Xi) 


c. 


/ 


Ct  j= 1 


(3.34) 


and 


Nm=N  +  /jI 

N=tK(^J){i~  yey( 

j£  C 


(3.35) 


The  regularization  value  //  must  be  large  enough  so  that  the  ( /Y/;  j  is  positive  definite 

(Mika  et  al.,  1999).  The  next  step  is  to  use  the  alpha  vectors  and  the  kernel  matrix  to 
project  the  input  feature  space  into  a  one  dimensional  space  as  follows: 


x,  =K(xi,xj)d 


(3.36) 


where  x  is  an  [/x  1]  vector.  Now  the  individual  class  distance  can  be  calculated  as 


Dc  = 

k 


\  fa 


/ 


Z*. 


ct  1=1 

/EC,. 


(3.37) 


The  distance  vector  Dc  is  of  length  c.  Once  the  distance  vectors  are  calculated,  the  next 
step  is  to  taking  the  average  of  D(  which  provides  a  separation  point  between  classes 
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when  Ck  >  2.  For  example  the  top  node  in  Figure  3.13  contains  classes  [12...  10]  and  the 
classes  are  divided  into  two  branches.  The  left  branch  contains  classes  [1  2  3  4  5  6  8] 
while  the  right  branch  contains  classes  [7  9  10].  Figure  3.14  is  the  corresponding  figure  to 
Figure  3.13  which  contains  the  10  classes  totaling  1000  samples  as  shown  on  the  x-axis 

and  the  sample  values  on  the  y-axis.  The  distance  values  of  Dc  are  shown  within  the 
figure  as  well.  Taking  the  average  of  DCj  is  -5.0313  which  is  the  value  used  to  separate 
the  ten  classes  into  two  sub  classes. 
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Figure  3.14.  Distance  values  Dc  for  a  10  class  problem. 


Once  a  new  branch  with  more  than  two-classes  is  built  the  distance  measure  is  calculated 
again.  The  nodes  of  the  tree  are  expanded  from  left  to  right  until  a  leaf  node  is  reached. 

Ill 


Consider  the  node  labeled  as  [2  3],  this  node  will  contain  two  leaf  nodes  labeled  [2]  and 

[3]. 

The  decision  tree  learning  algorithm  is  shown  as  follows: 

1.  Input  training  data  x,  with  class  labels,  the  kernel  parameters  and  the  classifier 
(KFD  or  SVM). 

2.  If  x,  is  empty  return. 

3.  Else  if  class  labels  of  x,  are  all  the  same  make  a  leaf  node  and  return. 

4.  Else  if  x,  contains  two-classes  make  two  leaf  nodes,  a  left  and  right,  and  return. 

5.  Else  if  x;  contains  three-classes  calculate  the  average  distance  for  each  class. 

6.  Divide  the  input  data  into  two  classes  creating  two  branches,  a  left  and  right. 

i.  If  the  left  Brach  contains  more  than  two  classes  step  1 . 

ii.  Else  make  a  leaf  node  and  go  to  step  iii. 

iii.  If  the  right  branch  contains  more  than  two  classes  go  to  step  1 . 

iv.  Else  make  a  leaf  node  and  return. 

7.  Return  tree 

8.  Train  the  two-class  classifiers  for  each  node  of  the  tree. 

3.4  Fusion  of  Multi-Class  Classification  Systems 

In  this  section  the  fusion  methods  of  the  multi-class  detection  systems  is  covered.  The 
class  labels  of  the  8  multi-class  detection  systems  are  fused.  In  this  research  there  are  10 
image  classes,  consisting  of  clean,  F5  (Westfeld,  2001;  2003),  JP  Hide  (Latham,  1999), 
JSteg  (Upham,  1993),  Model-base  (Sallee,  2003;  2006),  Model-based  Version  1.2 
(Sallee,  2008a),  OutGuess  (Provos,  2004),  Steganos  (2008),  StegHide  (Hetzl,  2003)  and 
UTSA  (Agaian  et  al.,  2006).  The  three  fusion  methods,  AdaBoost  (Bishop,  2006,  pp. 
358),  Bayesian  Belief  Networks  (Murphy,  2001)  and  Probabilistic  Neural  Networks 
(Leap  et  al.,  2007),  used  in  this  section  were  described  in  Chapter  2  Section  2.7. 
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3.4.1  AdaBoost  Boosting 


In  this  sub  section  the  7  detection  systems  are  fused  using  AdaBoost  (Rodriguez  and 
Peterson,  2008b).  Each  classification  model  is  defined  as  M*.  The  input  training  set  is  x,  = 

[xi,  X2,. .  .,x  ]  e  M",  with  each  sample  in  the  training  set  contains  one  target  value  C  =  Ck 

=  [Ci,C2,...,Cc],  k  =  1,2 ,...,c,  (known  as  the  class  labels  yt  e  C,  i  =  1,2,...,/).  The 

method  implemented  in  this  research  is  from  Bishop  (2006,  pp.  658).  The  method 
described  by  Bishop  (2006)  has  three  steps  as  follows 

1.  The  data  weighting  coefficients  { vv,}  are  initialized  as  w)'1  =  for  i  =  \  , ...,/. 

2.  For  k  =  1, _ ,7: 

(a)  Fit  a  classifier  Ma(x)  to  the  training  data  by  minimizing  the  weighted 
error  function 

i= 1 

where  l{Mk  (x(  )  ^  y,  ]  is  the  indicator  function  and  equals  1  when  Mk(x,) 
and  0  otherwise. 

(b)  Evaluate  the  quantities 

Tjwik)l(Mk(xi)^yi) 

£k=— - t - 

i=i 

and  then  use  these  to  evaluate 

ak=  lnj^ 

l  ek 

(c)  Update  the  data  weighting  coefficients 

w(i+1)  _  w{/ eaAMtM*yi) 
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3 .  Making  a  prediction  using  the  final  trained  model  for  an  input  image  sample 
x0  e  {C,  F5,  JPH,  JS,  MB ,  MB\  .2,  OG,  STN,  SH,  UTSA }  is  given  by 


/(x  o)  =  E«A(xo) 


k= 1 


3.4.2  Bayes  Network  for  Model  Averaging 

In  this  sub  section  the  7  detection  systems  are  fused  using  a  Bayesian  network  (Rodriguez 
et  ah,  2008b).  Each  classification  model  is  defined  as  M*  as  shown  in  Figure  3.15. 


Input  Image  (x0) 


Table  3.2  shows  the  prior  probabilities  that  a  target  T  is  a  clean  (C),  F5,  JP  Hide  (JPH), 
JSteg  (JS),  Model-based  (MB),  Model-based  Version  1.2  (MB  12),  OutGuess  (OG), 
Steganos  (STN),  StegHide  (SI I)  and  UTSA  (UTSA)  image. 


Table  3.2.  Distribution  of  the  image  types. 


Target!  T) 

T  = 

T  = 

T  = 

T  = 

T  = 

T  = 

T  = 

T  = 

T  = 

T  = 

C 

F5 

JPH 

JS 

MB 

MB1.2 

OG 

STN 

SH 

UTSA 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

0.1 

For  example  an  input  image  sample 

x0t{C,F5,JPH,JS,MB,MB\2,OG,STN,SH,UTSA}<is  shown  in  Figure  3.15  fed  into 
each  of  the  trained  classification  detection  systems  will  have  a  class  label  assigned  from 


114 


each  of  the  systems.  So,  to  determine  the  probability  that  the  class  label  is  C  when  each 
of  the  models  returns  a  class  label  as  C  the  model  averaging  topology  dictates  a  joint  pdf 
as 


P(M,\T)p(M1\T)P(M]\T)p(M4\T)p(M,\T)P(Mt\T)p(M,\T)P(T) 


The  method  used  to  facilitate  the  computations  in  the  model  averaging  is  Murphy’s 
(2001)  Bayes  Net  Toolbox  (BNT)  for  Matlab  resulting  in  the  following  calculations. 


P(T  =  C\Ml  ="C",M2  =”C",M3  ="C",M4  ="C",Ms  ="C",M6  =  ”C",M7  =  "C") 

_  P(T  =  C,M1  ="C",M2  =  "C",M3  ="C",M4  ="C",M5  =  "C",M6  ="C",M7  ="C") 
P(Ml  =  "C",M2  =mC",M3  =mC",M4  ="C",M5  =  "C",M6  =mC",M7  ="C") 

P(T  =  C,MX  ="C",M2  ="C",M3  ="C",M4  ="C",M5  =  "C",M6  ="C",M7  =  "C") 

~  S/,(r  =  x0’Mi  ="C",M2  ="C",M3  =  "C",M4="C",Ms="C",M6="C",M7  =  "C") 


Using  Bayes’  Rule  the  numerator  can  be  represented  as 
R(M1|c)p(M2|c)p(M3|c)p(M4|c)R(M5|c)R(M6|c)R(M7|c)R(r) 

=  (p(m1  ="C"\t  =  c)p(m2  =  "C"\t  =  c)p(m2  ="C"|r  =  c)r(m4  =Hc"|r  =  c) 
r(m5  =Mc"|r  =  c)p(M6  =  "C"|r  =  c)p(M7  =  "C"|r  =  c).p(r  =c)) 


and  the  denominator  as 

YJP{T  =  x0,Ml  =”C",M2  ="C",M3  ="C",M4  ="C",M5  ="C",M6  ="C",M7  ="C") 

x0 

=  Z(P(M!  =  "C"|r  =  x0)t)(m2  ="cM|r  =  x0)p(M3  ="C"|r  =  x0)R(M4  ="C"|r  =  x0) 

*0 

F(Mi="C"|r  =  x0)p(Ml="C"|r  =  xs)/>(M,="C"|r  =  x0)/>(r  =  x0)) 
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3.4.3  Probabilistic  Neural  Network  (PNN)  Fusion 


In  this  method  the  outputs  of  individual  classification  models  are  treated  as  input  features 
to  train  the  PNN  fusion  system.  The  key  is  to  use  the  class  labels  from  each  of  the 
systems  as  posterior  probability  estimates  and  employing  them  as  features  in  the  neural 
network.  It  should  be  noted  that  one  of  the  posterior  probabilities  from  the  input  classifier 
should  be  removed.  For  the  seven  individual  ten-class  classifiers  used  in  this  research 
each  of  the  classification  models,  M*,  will  contribute  seven  inputs  for  training  the  PNN. 


Input  7  10-class  9  Inputs  Output 

Training  Detection  Each  From  Probability  of 

Image  ,.Mod.eis.,  the  7  Models  Belonging  to 


Figure  3.16.  Probabilistic  Neural  Network  Classification  Structure. 
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3.5  Summary 


This  chapter  presented  three  new  methods  for  improving  multi-class  detection  systems 
for  the  kernel  Fisher’s  discriminant  and  support  vector  machines.  The  first  method  used 
in  the  system  is  the  generation  of  features  using  the  DCT  for  JPEG  images.  The  major 
components  of  the  new  feature  generation  method  are  the  decomposition  of  the  DCT 
coefficients  and  the  use  four  different  predictors.  The  second  new  method  consists  of  a 
new  feature  ranking  method  which  uses  the  individual  classifiers  to  rank  the  order  of  the 
features  on  class  separability  in  the  kernel  space.  The  final  method  consists  of  a  multi¬ 
class  tree  which  is  expanded  with  the  use  of  a  distance  measure  between  classes  in  the 
kernel  space.  In  addition  to  the  three  new  methods  used  in  the  development  of  multi-class 
classification  for  KFD  and  SVM  is  the  fusion  of  multiple  steganalysis  systems.  The 
fusion  techniques  used  are  based  on  modified  implementation  from  AdaBoost  (Bishop, 

2006) ,  Bayesian  networks  (Murphy,  2001)  and  probabilistic  neural  networks  (Leap  et  al., 

2007) . 

Chapter  4  demonstrates  results  with  an  increase  in  classifier  perfonnance.  The  results 
shown  compare  an  existing  multi-class  SVM  classifier  with  the  new  methods  shown  in 
this  chapter,  feature  selection,  multi-class  classifier  and  a  modified  simple  fusion  method. 
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IV.  Analysis  and  Results 


The  goal  of  the  steganalysis  classification  system  is  to  identify  an  input  JPEG  image  as  a 
clean  image  or  identify  the  embedding  algorithm  used.  The  nine  embedding  algorithms 
tested  over  include  F5  (Westfeld,  2001;  2003),  JP  Hide  (Latham,  1999),  JSteg  (Upham, 
1993),  Model-base  (Sallee,  2003;  2006),  Model-based  Version  1.2  (Sallee,  2008a), 
OutGuess  (Provos,  2004),  Steganos  (2008),  StegHide  (Hetzl,  2003)  or  UTSA  (Agaian  et 
al.,  2006).  This  chapter  compares  the  perfonnance  of  the  KFD  and  SVM  multi-class 
system  developed  against  four  (i.e.,  EM,  k-NN,  Parzen  window  and  PNN)  multi-class  and 
three  fusion  (i.e.,  AdaBoost,  Bayes  and  PNN  fusion)  classification  techniques.  In  order  to 
statistically  compare  the  systems,  k-fold  cross  validation  is  used  for  both  training  and 
testing  the  system  within  a  clean  JPEG  image  dataset  and  nine  stego  image  datasets.  The 
statistical  tool  applied  for  analysis  is  the  two  tailed  student  /-test. 

The  clean  JPEG  image  dataset  used  as  a  cover  image  set  for  analyzing  the  system 
includes  1000  RGB  images  of  size  512x512  with  a  quality  factor  of  75%.  Nine  stego 
image  datasets  are  generated  from  the  clean  dataset  with  a  stego  message  from  the 
aforementioned  nine  embedding  tools  of  4000  characters  which  is  equivalent  to  one  page 
of  text.  The  number  of  DCT  coefficients  altered  within  a  color  layer  of  a  JPEG  image  is 
known  as  the  embedding  rate  (Kharrazi  et  al.,  2005).  The  average  embedding  rate  of  the 
coefficients  altered  for  each  stego  image  dataset  are  as  follows: 

•  F5  has  an  average  embedding  rate  6.25%. 

•  JP  Hide  (JPH)  has  an  average  embedding  rate  3.76% 

•  JSteg  (JS)  has  an  average  embedding  rate  7.53% 

•  Model-based  (MB)  has  an  average  embedding  rate  5.36% 

•  Model-based  Version  1.2  (MB  1.2)  has  an  average  embedding  rate  5.68% 

•  OutGuess  (OG)  has  an  average  embedding  rate  3.24% 

•  Steganos  (STN)  has  an  average  embedding  rate  0.75% 

•  StegHide  (SH)  has  an  average  embedding  rate  2.30% 

•  UTSA  has  an  average  embedding  rate  5.38% 
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Note  that  in  testing  and  training,  100  images  are  chosen  from  each  clean  and  stego  image 
dataset.  The  clean  images  used  within  the  clean  image  dataset  do  not  appear  as  stego 
images  used  within  the  stego  image  datasets,  nor  does  any  stego  image  reappear 
embedded  with  another  steganography  algorithm.  For  example,  none  of  the  F5  images 
were  the  same  as  the  JSteg  images. 

This  chapter  demonstrates  the  perfonnance  of  the  steganalysis  classification  system 
developed  in  this  research.  Section  4.1  describes  the  statistical  methods  of  measure  used 
for  testing  and  validation  in  the  experiment.  The  results  include  a  comparison  of  the 
feature  generation  methods:  wavelet  feature  generation,  DCT  feature  generation  and  DCT 
directional  and  frequency  decomposition  feature  generation.  In  Section  4.3,  results  on  the 
steganalysis  dataset  for  eight  multi-class  classification  methods  including  expectation 
maximization  with  mixture  models  (EM),  ^-nearest  neighbors  (/c-NN),  kernel  Fisher’s 
discriminant  (KFD),  Parzen  window,  probabilistic  neural  networks  (PNN),  support  vector 
machines  (SVM)  and  StegoWatch  are  discussed,  respectively.  Section  4.4  demonstrates  a 
performance  improvement  when  utilizing  and  fusing  several  classification  algorithms 
together.  Experimental  results  of  three  fusion  techniques  using  AdaBoost,  Bayesian 
neural  network,  and  probabilistic  neural  network,  are  shown.  Finally,  a  summary  of  all 
the  results  is  presented  in  Section  4.5. 

4,1  Confirming  and  Validating  the  Analysis 

In  statistics  a  result  is  statistically  significant  if  it  is  unlikely  to  have  occurred  by  chance. 
A  statistically  significant  difference  between  two  sets  of  results  simply  implies  that  there 
is  statistical  evidence  that  there  is  a  difference.  This  however,  does  not  indicate  that  the 
difference  is  necessarily  large.  In  this  research  the  results  are  generated  using  k-fold  cross 
validation  to  determine  the  classification  accuracy  of  the  classification  models.  A  /-test 
between  paired  samples  about  the  means  with  a  confidence  level  of  95%  is  used  to 
determine  the  statistical  significance  of  the  results. 
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In  k-fold  cross-validation,  the  original  sample  is  partitioned  into  k  subsamples.  Of  the  k 
subsamples,  a  single  subsample  is  retained  as  the  test  data  for  testing  the  model,  and  the 
remaining  k- 1  subsamples  are  used  as  training  data.  The  cross-validation  process  is  then 
repeated  k  times  (the  folds),  with  each  of  the  k  subsamples  used  exactly  once  as  the 
validation  data.  The  k  results  from  the  folds  are  averaged  to  produce  a  single  estimation 
(Kohavi,  1995;  Mitchell,  1997;  Russell  and  Norvig,  2003). 

In  this  chapter  the  data  is  partitioned  into  five  groups  of  equal  size  as  shown  in  Figure 
4.1.  For  each  run  four  of  the  groups  are  used  for  training  the  classification  model  and  the 
remaining  group  is  used  for  testing  the  model.  This  procedure  is  repeated  for  five  runs 
where  the  runs  are  for  all  five  possible  choices  of  the  held  out  test  group. 


Total  Number  of  Samples 


< - 

- ► 

Run  1 

Testing  Data 

Training  Data 

Run  2 

Training  Data 

Testing  Data 

Training  Data 

Run  3 

Training  Data 

Testing  Data 

Training  Data 

Run  4 

Training  Data 

Testing  Data 

Training  Data 

Run  5 

Training  Data 

Testing  Data 

Figure  4.1.  5-fold  cross-validation  with  5  runs  consisting  of  80%  of  the  data  for  training 
the  classification  model  and  20%  for  testing  the  training  model. 


To  ensure  that  the  test  of  significance  is  calculated  properly  the  Lilliefors  test  for 
nonnality  is  used  to  determine  if  the  results  being  analyzed  are  normally  distributed 
(Lilliefors,  1967;  Abdi  and  Molin,  2007).  If  the  result  is  detennined  that  the  results  are 
normally  distributed  the  /-test  is  used  to  test  for  statistical  significance  (Hogg,  and  Tanis, 
1993;  Kohavi,  1995;  Rice,  1995;  Wackerly  et  ah,  1996).  On  the  other  hand,  if  the  test  for 
nonnality  fails  then  the  Wilcoxon  test  is  used  to  detennine  if  the  results  are  significant. 
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In  the  next  section  the  results  are  shown  in  tables  using  the  5 -fold  cross  validation.  The 
tables  are  accompanied  by  analysis  to  determine  if  the  reported  results  are  statistically 
significant. 

4.2  Feature  Generation  Method  Comparison 

The  results  in  this  section  show  a  comparison  between  the  three  feature  generation 
methods  of  wavelet  features,  DCT  features,  and  DCT  directional  and  frequency 
decomposition  features,  and  results  that  use  all  three  feature  generation  methods 
combined.  Prior  to  classification  the  data  is  prepared  using  the  data  standardization 
described  in  subsection  2.3.1  (Dillon  and  Goldstein,  1984,  pp.  12-13).  Feature 
discrimination  capability  results  from  executing  a  SVM  two-class  classifier  without  and 
with  the  SVM-kernel  feature  ranking  described  in  subsection  3.2.1.  The  SVM  method 
used  is  SVM  light  (Joachims,  1998,  2007).  The  feature  ranking  method  used  is  the  SVM- 
kemel  feature  ranking  method  presented  in  Section  3.2.  The  kernel  function 

i  \  1 

V ( x(,  x ,  j  used  is  the  radial  basis  function  e  ’  with  the  parameter 

/=l/^2(l2)'j  and  the  upper  bound  C  =  12.  The  results  of  the  analysis  include  the 

percentage  of  true  positive  and  true  negatives  shown  on  a  class-by-class  basis  where  the 
clean  image  sets  are  compared  against  each  steganography  embedding  image  set.  The 
true  negative  indicates  the  percentage  of  clean  images  correctly  classified  as  clean  images, 
while  the  true  positive  indicates  the  percentage  of  stego  images  correctly  classified  as 
stego  images.  The  average  of  true  negative  and  true  positive  is  the  classification  accuracy 
(CA). 
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4.2.1  Wavelet  Feature  Generation  (Lyu  and  Farid,  2004) 


The  results  for  the  wavelet  feature  generation,  which  generates  72  features,  are  shown 
without  and  with  feature  ranking  in  Tables  4.1  and  4.2,  respectively. 


Table  4.1.  Classification  accuracy  for  wavelet  feature  generation. 


Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

True 

Negative 

64. 8± 
5.0 

94. 3± 
9.8 

98. 1± 
2.6 

59. 4± 
13.6 

59. 5± 
10.0 

71. 3± 
8.7 

74. 8± 
11.9 

50. 7± 
6.1 

80. 7± 
8.5 

True 

Positive 

66. 6± 
.34 

81.6± 

6.3 

98. 1± 
2.6 

56. 3± 
9.7 

56. 9± 
7.5 

70. 7± 
9.2 

68. 5± 
6.7 

50. 2± 
7.0 

78. 4± 
7.6 

Classification 

Accuracy 

65. 7± 
3.9 

87. 9± 
5.1 

98. 1± 
1.0 

57. 8± 
11.6 

58. 2± 
8.7 

71.0± 

8.9 

71.6± 

8.0 

50. 4± 
6.5 

79. 5± 
6.7 

Table  4.2.  Classification  accuracy  for  wavelet 


feature  generation  using  feature  ranking. 


Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

No.  of 
Features 

25 

25 

15 

19 

20 

22 

16 

12 

39 

True 

Negative 

74. 9± 
4.4 

99. 0± 
2.1 

99.1+ 

2.1 

71.6+ 

6.0 

73.1  + 
6.9 

74.1+ 

7.2 

83.8+ 

2.8 

64.6+ 

4.5 

86.4+ 

4.0 

True 

Positive 

78. 0± 
4.4 

91. 6± 
8.6 

98.1  + 
2.6 

66.4+ 

3.4 

69.6+ 

5.1 

74.0+ 

6.5 

72.6+ 

3.4 

61.4+ 

2.6 

82.9+ 

3.5 

Classification 

Accuracy 

76.4± 

2.7 

95. 3± 
3.8 

98. 5± 
1.3 

69.0+ 

4.1 

71.3+ 

5.5 

74.0+ 

6.7 

78.2+ 

2.4 

63.0+ 

2.6 

84.6+ 

3.2 

The  results  shown  in  Table  4.2  indicate  an  improvement  of  detection  accuracy  by  proper 
selection  of  features  during  training.  The  second  row  shows  the  number  of  features 
among  72  identified  by  the  SVM-kernel  feature  ranking  method.  The  statistical 
significance  of  selecting  features  with  the  proposed  feature  saliency  metric  is  depicted  in 
Table  4.3.  As  can  be  seen  in  the  significance  testing  for  classification  accuracy,  the  Clean 
vs.  F5  image  classes,  Clean  vs.  MB  1.2,  and  Clean  vs.  SH  comparisons  show  statistically 
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significant  difference  in  the  mean,  while  the  difference  in  the  mean  for  the  other 
embedding  methods  are  not  statistically  significant. 


Table  4.3.  /-test;  paired  two  samples  for  means  between  Tables  4.1  and  4.2. 


/-Critical  Two-Tail;  /4,0.97s  =  2.776,  n\  =  «2  =  5  (corresponc 

ling  to  5-fold),  a  =  0.05 

Image 

Classes 

Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

/-Stat 

5.97 

2.39 

0.186 

2.42 

5.03 

0.86 

1.68 

4.09 

1.86 

Statistically 

Significant 

Yes 

No 

No 

No 

Yes 

No 

No 

Yes 

No 

4.2.2  DCT  Feature  Generation  (Pevny  and  Fridrich,  2006) 


The  results  for  the  DCT  feature  generation  (Pevny  and  Fridrich,  2006)  are  shown  without 
and  with  feature  selection  in  Tables  4.4  and  4.5.  The  274  features  generated  (Pevny  and 
Fridrich,  2006)  are  an  extension  of  the  original  features  described  in  Section  2.2.2 
developed  by  Fridrich  (2004). 


Table  4.4.  Classification  accuracy  : 

for  DCT : 

feature  generation. 

Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

True 

Negative 

100± 

0.0 

100± 

0.0 

100± 

0.0 

99. 0± 
2.1 

100± 

0.0 

100± 

0.0 

86. 5± 
6.9 

100± 

0.0 

100± 

0.0 

True 

Positive 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

87. 8± 
6.0 

100± 

0.0 

100± 

0.0 

Classification 

Accuracy 

100± 

0.0 

100± 

0.0 

100± 

0.0 

99. 5± 

1.1 

100± 

0.0 

100± 

0.0 

87. 1± 
6.0 

100± 

0.0 

100± 

0.0 
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Table  4.5.  Classification  accuracy  for  DCT  feature  generation  using  feature  ranking 


Clean 

Clean 

Clean 

Clean 

Clean 

Clean 

Clean 

Clean 

Clean 

vs. 

vs. 

vs. 

vs. 

vs. 

vs. 

vs. 

vs. 

vs. 

F5 

JPH 

JS 

MB 

MB  1.2 

OG 

STN 

SH 

UTSA 

No.  of 
Features 

12 

24 

5 

7 

7 

5 

23 

5 

5 

True 

100± 

100+ 

100+ 

100+ 

100+ 

100+ 

89.1+ 

100+ 

100+ 

Negative 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

3.6 

0.0 

0.0 

True 

100± 

100+ 

100+ 

100+ 

100+ 

100+ 

88.5+ 

100+ 

100+ 

Positive 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

5.7 

0.0 

0.0 

Classification 

100+ 

100+ 

100+ 

100+ 

100+ 

100+ 

88. 7± 

100+ 

100+ 

Accuracy 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.8 

0.0 

0.0 

The  results  shown  in  Table  4.5  indicate  after  the  SVM-kernel  feature  ranking,  only  a  few 
of  the  274  features  are  necessary  for  a  perfect  classification  accuracy  in  most  of  the  cases 
except  the  Clean  vs.  STN  image  classes.  The  statistical  significance  of  selecting  features 
with  the  proposed  feature  ranking  is  depicted  in  Table  4.6.  As  can  be  seen  in  the 
significance  testing  for  classification  accuracy,  only  the  Clean  vs.  STN  image  classes 
show  significant  difference  in  the  mean,  while  the  difference  in  the  mean  for  the  other 
stego  embedding  methods  are  not  statistically  significant.  Although  there  are  no 
improvement  (quite  difficult  to  improve  from  a  perfect  classification)  in  the  classification 
accuracy  even  with  the  inclusion  of  a  feature  ranking  method,  the  utility  is  apparent  in  the 
reduced  number  of  features  necessary  to  still  achieve  perfect  classification. 


Table  4.6.  /-test;  paired  two  samples  for  means  between  Tables  4.4  and  4.5. 


/-Critical  Two-Tail;  /4,0.97s  =  2.776,  n\  =  «2  =  5  (corresponc 

ling  to  5-fold),  a  =  0.05 

Image 

Classes 

Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

/-Stat 

0.0 

0.0 

0.0 

1.00 

0.0 

0.0 

0.43 

0.0 

0.0 

Statistically 

Significant 

No 

No 

No 

No 

No 

No 

Yes 

No 

No 
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4.2.3  DCT  Directional  and  Frequency  Decomposition 


The  results  for  the  DCT  directional  and  frequency  decomposition  feature  generation 
described  in  Section  3.1  are  shown  without  feature  selection  in  Table  4.7  and  with  feature 
selection  in  Table  4.8.  This  feature  generation  method  results  in  180  features. 


Table  4.7.  Classification  accuracy  for  DCT  directional  and  frequency  feature  generation. 


Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

True 

Negative 

95.4± 

5.3 

99. 0± 
2.1 

99. 0± 
2.1 

99. 0± 
2.1 

96.4± 

5.7 

94. 5± 
5.8 

96. 2± 
6.0 

98. 0± 
2.8 

100± 

0.0 

True 

Positive 

95. 4± 
5.3 

100± 

0.0 

98. 2± 
4.1 

93. 7± 
4.9 

93. 8± 
6.6 

97. 0± 
2.7 

89. 2± 
4.2 

92. 9± 
6.3 

100± 

0.0 

Classification 

Accuracy 

95.4± 

2.1 

99. 5± 

1.1 

98. 6± 
2.0 

96. 3± 
1.8 

95.1+ 

2.3 

95. 7± 
2.5 

92. 7± 
1.9 

95. 4± 
3.0 

±100± 

0.0 

Table  4.8.  Classification  accuracy  for  DCT  directional  and  frequency  feature  generation 
_ _ _  using  feature  ranking.  _ _ _ _ 


Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

No.  of 
Features 

21 

35 

22 

26 

27 

24 

23 

25 

22 

True 

Negative 

98. 2± 
4.1 

100± 

0.0 

100± 

0.0 

98.2± 

4.1 

98.2± 

4.1 

98. 1± 
2.6 

100.0± 

0.0 

100± 

0.0 

100± 

0.0 

True 

Positive 

100± 

0.0 

100± 

0.0 

100± 

0.0 

98. 1± 
2.6 

98. 1± 
2.6 

98.1+ 

2.6 

97. 1± 
2.6 

96. 3± 
3.8 

100± 

0.0 

Classification 

Accuracy 

99. 1± 
2.0 

100± 

0.0 

100± 

0.0 

98.1  + 
1.9 

98. 1± 
1.9 

98.1+ 

1.1 

98. 5± 
1.3 

98.1+ 

1.9 

100± 

0.0 

The  results  shown  in  Table  4.8  indicate  an  improvement  of  detection  accuracy  by  proper 
ranking  of  features  during  training.  The  second  row  shows  the  number  of  features  among 
180  identified  by  the  presented  feature  saliency  metric,  i.e.,  the  SVM-kernel  feature 
ranking  method.  The  statistical  significance  of  selecting  features  with  the  proposed 
feature  ranking  is  depicted  in  Table  4.9.  As  can  be  seen  for  the  classification  accuracy 
and  significance  testing,  the  Clean  vs.  F5,  Clean  vs.  MB  1.2,  Clean  vs.  STN,  Clean  vs.  SH 
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embedding  methods  show  significant  difference  in  the  mean,  while  the  difference  in  the 
mean  for  the  other  embedding  methods  are  not  statistically  significant. 


Table  4.9.  /-test;  paired  two  samples  for  means  between  Tables  4.7  and  4.8. 


/-Critical  Two-Tail;  /4,0.97s  =  2.776,  n\  =  «2  =  5  (corresponc 

ling  to  5-fold),  a  =  0.05 

Image 

Classes 

Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

/-Stat 

4.06 

1.00 

1.51 

1.69 

6.37 

2.23 

5.94 

3.29 

0.0 

Statistically 

Significant 

Yes 

No 

No 

No 

Yes 

No 

Yes 

Yes 

No 

4.2.4  Combined  Features 


The  wavelet  features  (Lyu  and  Farid,  2004),  DCT  features  (Pevny  and  Fridrich,  2006) 
and  DCT  directional  and  frequency  decomposition  features  are  combined  to  increase  the 
classification  accuracy  for  each  of  the  targeted  embedding  methods.  The  results  for  the 
combined  features  are  shown  with  feature  selection  in  Table  4.10.  The  total  number  of 
features  in  the  combination  of  the  three  methods  is  526. 


Table  4.10.  Classification  accuracy  for  combined  feature  generation  using  feature 
_ _ _ _  ranking. _ _ _ _ _ 


Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

No.  of 
Features 

11 

18 

5 

6 

10 

5 

15 

7 

5 

True 

Negative 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

True 

Positive 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

Classification 

Accuracy 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

100± 

0.0 

The  results  shown  in  Table  4.10  indicate  that  perfect  detection  accuracies  are  obtained  for 
each  image  class  by  combining  the  three  feature  generation  methods  and  performing  a 
proper  ranking  of  the  526  features.  Statistical  significance  comparisons  are  performed  for 
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the  combined  features  versus  the  first  three  compared  methods  in  this  chapter.  The 
statistical  significance  shown  in  Table  4.11  is  the  classification  accuracy  comparison 
between  the  combined  features  from  Table  4.10  and  the  wavelet  feature  generation 
results  of  Table  4.2. 


Table  4.11.  /- test:  paired  two  samples  for  means  of  wavelet  features  with  feature  ranking 


vs.  combined  features  with  feature  ranking. 


/-Critical  Two-Tail;  /4,0.97s  =  2.776,  n\  =  n2  =  5  (corresponc 

ling  to  5-fold),  a  =  0.05 

Image 

Classes 

Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

/-Stat 

19.4 

2.73 

2.44 

16.5 

11.4 

8.55 

20.0 

30.8 

10.6 

Statistically 

Significant 

Yes 

No 

No 

Yes 

Yes 

Yes 

Yes 

Yes 

Yes 

The  statistical  significance  shown  in  Table  4.12  is  the  classification  accuracy  comparison 
between  the  combined  features  from  Table  4.10  and  the  DCT  feature  generation  results  of 
Table  4.5. 


Table  4.12.  /-test:  paired  two  samples  for  means  of  DCT  features  with  feature  ranking  vs. 
_ combined  features  with  feature  ranking. _ 


/-Critical  Two-Tail;  /4,0.97s  =  2.776,  n\  =  n2  =  5  (corresponc 

ling  to  5-fold),  a  =  0.05 

Image 

Classes 

Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

/-Stat 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

8.70 

0.0 

0.0 

Statistically 

Significant 

No 

No 

No 

No 

No 

No 

Yes 

No 

No 

The  statistical  significance  shown  in  Table  4.13  is  the  classification  accuracy  comparison 
between  the  combined  features  from  Table  4.10  and  the  DCT  directional  and  frequency 
feature  generation  results  of  Table  4.8. 
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Table  4.13.  /-test:  paired  two  samples  for  means  of  DCT  directional  and  frequency 
features  with  feature  ranking  vs.  combined  features  with  feature  ranking. 


/-Critical  Two-Tail;  /4,0.97s  =  2.776,  n\  =  ni  —  5  (corresponc 

ling  to  5-fold),  a  =  0.05 

Image 

Classes 

Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

/-Stat 

1.0 

0.0 

0.0 

2.17 

2.17 

4.00 

2.17 

0.0 

0.0 

Statistically 

Significant 

No 

No 

No 

No 

No 

Yes 

No 

No 

No 

The  results  shown  in  Table  4.11  indicate  a  significant  improvement  in  classification 
accuracy  when  comparing  the  wavelet  features  with  feature  ranking  versus  the  combined 
features  with  feature  ranking  method  for  all  embedding  methods,  except  JPHide  and 
JSteg.  Table  4.12  only  shows  classification  accuracy  improvement  for  STN  when 
comparing  DCT  features  with  feature  ranking  (using  23  features)  versus  the  combined 
features  with  feature  ranking  (using  15  features).  Similarly,  in  Table  4.13  classification 
accuracy  improvement  is  achieved  for  the  detection  of  OG  when  comparing  DCT 
decomposition  features  with  feature  ranking  (using  24  features)  versus  combined  feature 
with  feature  ranking  (using  5  features).  This  analysis  further  highlights  the  strengths  and 
weaknesses  of  each  of  the  feature  generation  methods  and  its  capability  of  detecting 
certain  embedding  methods.  By  combining  the  features  from  the  three  feature  generation 
methods  and  applying  the  SVM-kernel  feature  ranking  method  the  classification  accuracy 
is  improved  in  identifying  stego  images  from  clean  images. 


4.2.5  Summary  of  Feature  Generation  Methods 


From  subsection  4.2.1  to  4.2.4,  the  results  from  each  individual  feature  generation 
method  and  the  combined  features  are  demonstrated.  A  summary  table  on  classification 
accuracies  is  shown  in  Table  4.14.  It  is  apparent  that  the  combined  features  integrate  the 
capability  of  the  three  methods  and  achieves  perfect  classification  accuracy. 


128 


Table  4.14.  Classification  accuracy  summary  for  the  individual  feature  generation  and 
_ _ combined  features  when  feature  ranking  is  used.  _ _ 


Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTS  A 

Wavelets 

76. 4± 
2.7 

95. 3i 
3.8 

98. 5t 
1.3 

69. Oi 
4.1 

71. 3i 
5.5 

74. 0i 
6.7 

78. 2i 
2.4 

63. Oi 
2.6 

84. 6i 
3.2 

DCT 

100± 

0.0 

loot 

0.0 

loot 

0.0 

loot 

0.0 

loot 

0.0 

loot 

0.0 

88. 7i 
2.8 

loot 

0.0 

loot 

0.0 

DCT 

Decomp 

99.1± 

2.0 

loot 

0.0 

loot 

0.0 

98.1i 

1.9 

98.1i 

1.9 

98.1i 

1.1 

98. 5i 
1.3 

98.1i 

1.9 

loot 

0.0 

Combined 

loot 

0.0 

loot 

0.0 

loot 

0.0 

loot 

0.0 

loot 

0.0 

100± 

0.0 

loot 

0.0 

loot 

0.0 

100± 

0.0 

The  SVM-kernel  feature  ranking  method  has  shown  that  the  best  subset  of  features  can  be 
identified  to  improve  classification  accuracy  of  the  two-class  classifier.  Table  4.15  shows 
how  each  of  the  feature  generation  method  contributes  to  the  various  stego  methods  in  the 
number  of  features  to  obtain  the  combined  features  for  clean  versus  each  stego  image 
class  as  in  Table  4.10.  For  a  list  of  specific  features  associated  with  the  methods  in  Table 
4.15  the  reader  is  referred  to  Appendix  A. 


Table  4. 15.  Number  of  features  used  from  each  of  the  feature  generation  method  in 

feature  combination. 


Clean 

vs. 

F5 

Clean 

vs. 

JPH 

Clean 

vs. 

JS 

Clean 

vs. 

MB 

Clean 

vs. 

MB  1.2 

Clean 

vs. 

OG 

Clean 

vs. 

STN 

Clean 

vs. 

SH 

Clean 

vs. 

UTSA 

No.  of 
Features 

11 

18 

5 

6 

10 

5 

15 

7 

5 

Wavelets 

1 

3 

0 

0 

0 

0 

2 

0 

0 

DCT 

5 

12 

5 

5 

5 

5 

7 

5 

5 

DCT 

Decomp 

5 

3 

0 

1 

5 

0 

6 

2 

0 
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4.3  Results  for  Individual  Multi-class  Detection  Systems 


This  section  provides  results  for  the  seven  multi-class  classification  systems  designed  to 
solve  the  steganalysis  problem  of  identifying  the  embedding  methods.  For  each  multi¬ 
class  detection  system  the  process  of  performing  feature  preprocessing,  feature  extraction, 
feature  ranking,  classification  and  multi-class  classification  is  followed.  In  this  section 
the  six  classification  methods  described  in  Section  2.6  and  a  commercial  tool  are  used  as 
part  of  seven  individual  multi-class  detection  systems:  expectation  maximization,  k- 
nearest  neighbors,  Parzen  window,  probabilistic  neural  networks,  kernel  Fisher’s 
discriminant,  support  vector  machines,  and  StegoWatch,  which  is  a  commercial  detection 
tool.  The  features  used  for  classification  are  the  combination  of  wavelet  features,  DCT 
features  and  the  presented  DCT  directional  and  frequency  decomposition  features.  The 
feature  improvement  includes  data  standardization,  feature  extraction  and  feature  ranking 
methods  which  are  used  in  conjunction  with  the  multi-class  systems.  All  normalization, 
feature  ranking/selection,  and  settings  were  tested  where  only  the  best  performing 
combination  is  presented.  For  example,  in  the  EM  method  in  Section  4.3.1,  the 
Bhattacharyya  distance  is  used  instead  of  the  other  four  feature  ranking/selection 
discussed  in  Section  2.5  since  the  Bhattacharyya  distance  provided  the  highest 
classification  accuracy  combined  with  the  other  parameter  combinations. 

4.3.1  Expectation  Maximization 

Table  4.16  shows  the  classification  accuracy  from  a  5 -fold  cross  validation  when 
performing  multi-class  classification  using  expectation  maximization  (EM).  The  feature 
improvement  methods  and  classification  parameters  used  in  expectation  maximization 
are  listed  in  the  following,  in  which  the  combination  of  parameters  provides  the  highest 
classification  accuracy. 

•  The  data  for  this  model  is  not  normalized; 

•  Bhattacharyya  distance  is  used  for  feature  ranking  with  the  top  34  out  of  526 
features  selected; 
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•  PC  A  is  performed  on  the  subset  of  un-normalized  34  features  resulting  in  12 
principal  components  with  eigenvalues  greater  than  1 ; 

•  The  number  of  clusters  are  detennined  by  using  a  clustering  algorithm  on  each  of 
the  training  classes  (Sanguinetti  et  ah,  2005)  prior  to  training  the  EM  algorithm 
where  two-clusters  are  used  for  each  class  with  the  exception  of  the  Steganos 
class  which  requires  three  clusters,  and  each  class  is  trained  individually  where 
the  10  individual  models  return  the  mean  and  covariance’s  used  with  the  Bayes 
classifier. 


Table  4.16.  Classification  accuracy  for  10-class  expectation  maximization  classifier. 


Actual 


Clean 

F5 

JPH 

JS 

MB 

MB  12 

OG 

STN 

SH 

UTSA 

Clean 

83± 

0+ 

0+ 

0+ 

1+ 

0+ 

0+ 

14+ 

0+ 

0+ 

5.7 

0.0 

0.0 

0.0 

2.2 

0.0 

0.0 

4.4 

0.0 

0.0 

F5 

0± 

88± 

2± 

0+ 

1+ 

0+ 

0+ 

2+ 

0+ 

4+ 

0.0 

9.0 

2.7 

0.0 

2.2 

0.0 

0.0 

4.4 

0.0 

4.1 

JPH 

0+ 

2± 

90+ 

0+ 

0+ 

0+ 

0+ 

5+ 

0+ 

0+ 

0.0 

2.7 

7.0 

0.0 

0.0 

0.0 

0.0 

5.0 

0.0 

0.0 

JS 

0+ 

0± 

0+ 

100+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

MB 

0± 

0± 

0+ 

0+ 

51+ 

49+ 

0+ 

0+ 

5+ 

0+ 

0.0 

0.0 

0.0 

0.0 

11.9 

11.9 

0.0 

0.0 

7.0 

0.0 

MB12 

0± 

0+ 

0+ 

0+ 

38+ 

42+ 

0+ 

0+ 

6+ 

0+ 

0.0 

0.0 

0.0 

0.0 

7.5 

9.0 

0.0 

0.0 

10.8 

0.0 

OG 

1± 

0± 

0+ 

0+ 

0+ 

0+ 

99+ 

0+ 

0+ 

0+ 

2.2 

0.0 

0.0 

0.0 

0.0 

0.0 

2.2 

0.0 

0.0 

0.0 

STN 

16± 

0+ 

6± 

0+ 

0+ 

0+ 

0+ 

79+ 

1+ 

0+ 

6.5 

0.0 

8.2 

0.0 

0.0 

0.0 

0.0 

18.5 

2.2 

0.0 

SH 

0± 

4± 

2± 

0+ 

9+ 

9+ 

1+ 

0+ 

86+ 

0+ 

0.0 

4.1 

4.4 

0.0 

6.5 

6.5 

2.2 

0.0 

6.5 

0.0 

UTSA 

0± 

6± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

2+ 

96+ 

0.0 

6.5 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.7 

4.1 

In  Table  4.16,  the  results  show  that  the  MB  and  MB  12  image  classes  cannot  be  separated 
by  the  EM  multi-class  system  since  their  classification  accuracies  are  of  mean  values 
51%  and  42%,  respectively.  The  results  show  that  a  MB  stego  image  for  testing  has  a 
38%  and  9%  probability  of  being  misclassified  as  MB  12  and  SH,  respectively.  On  the 
other  hand,  a  MB  12  stego  image  for  testing  has  a  49%  and  9%  probability  of  being 
misclassified  as  MB  and  SH,  respectively.  EM  perfonns  best  in  identifying  JPH,  JS,  OG 
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and  UTSA  image  classes  with  classification  accuracies  >  90%.  EM  performs  fairly  well 
in  identifying  Clean,  F5,  STN  and  SH  image  classes  with  classification  accuracies 
between  75%  to  89%. 


4.3.2  A-Nearest  Neighbors  (A-NN) 

Table  4.17  shows  the  classification  accuracy  from  5 -fold  cross  validation  when 
performing  multi-class  classification  using  A-nearest  neighbors.  The  feature  improvement 
methods  and  classification  parameters  used  in  A-nearest  neighbors  are  listed  in  the 
following,  in  which  the  combination  of  parameters  provides  the  highest  classification 
accuracy. 

•  The  data  is  normalized  using  min-max  nonnalization; 

•  Fisher’s  linear  discriminant  is  used  for  ranking  the  features  with  the  top  34  out  of 
526  features  selected; 

•  The  number  of  nearest  neighbors  are  determined  experimentally  based  on 
classification  accuracy  with  A  =  5. 
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Table  4.17.  Classification  accuracy  for  10-class  &-NN  classifier. 


Actual 

Clean 

F5 

JPH 

JS 

MB 

MB  12 

OG 

STN 

SH 

UTSA 

78± 

0± 

2± 

0± 

1± 

0+ 

1+ 

30+ 

1+ 

0+ 

v_xlcail 

7.5 

0.0 

2.7 

0.0 

2.2 

0.0 

2.2 

15.4 

2.2 

0.0 

1± 

95± 

1± 

0± 

0± 

0+ 

0+ 

0+ 

1+ 

6+ 

Jr  J 

2.2 

6.1 

2.2 

0.0 

0.0 

0.0 

0.0 

0.0 

2.2 

6.5 

TPFT 

0+ 

1+ 

92± 

0± 

0± 

0+ 

0+ 

5+ 

0+ 

0+ 

J1H 

0.0 

2.2 

2.7 

0.0 

0.0 

0.0 

0.0 

8.6 

0.0 

0.0 

tc 

0+ 

0± 

0± 

100± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

"O 

J  a 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

<D 

MR 

0± 

0± 

0± 

0± 

54± 

52+ 

0+ 

0+ 

7+ 

0+ 

1V1JD 

0.0 

0.0 

0.0 

0.0 

6.5 

9.0 

0.0 

0.0 

10.9 

0.0 

<D 

MR1  9 

0± 

0± 

0± 

0+ 

38+ 

47+ 

0+ 

0+ 

3+ 

0+ 

Ph 

1V1D  I Z 

0.0 

0.0 

0.0 

0.0 

4.4 

9.7 

0.0 

0.0 

4.4 

0.0 

on 

0+ 

0± 

0± 

0± 

0+ 

0+ 

99+ 

0+ 

1+ 

0+ 

uu 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.2 

0.0 

2.2 

0.0 

21± 

1+ 

5± 

0+ 

0+ 

0+ 

0+ 

65+ 

1+ 

0+ 

O  1 1M 

8.2 

2.2 

3.5 

0.0 

0.0 

0.0 

0.0 

11.1 

2.2 

0.0 

QU 

0± 

0± 

0± 

0± 

7+ 

1+ 

0+ 

0+ 

86+ 

0+ 

Oil 

0.0 

0.0 

0.0 

0.0 

5.7 

2.2 

0.0 

0.0 

10.8 

0.0 

TTT^A 

0+ 

3± 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

94+ 

U  1  o/\ 

0.0 

2.7 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

6.5 

In  Table  4.17,  the  results  show  that  the  MB  and  MB  12  image  classes  cannot  be  separated 
by  the  &-NN  multi-class  system  since  their  classification  accuracies  are  of  mean  values 
54%  and  47%,  respectively.  This  indicates  that  a  MB  stego  image  for  testing  has  a  38% 
and  7%  probability  of  being  misclassified  as  MB  12  and  SH,  respectively.  On  the  other 
hand,  a  MB  12  stego  image  for  testing  has  a  52%  probability  of  being  misclassified  as  MB. 
In  addition,  k-NN  barely  does  better  than  a  coin  toss  in  classifying  STN  with  a 
classification  accuracy  of  65%  with  a  30%  probability  of  misclassifying  STN  as  Clean,  k- 
NN  performs  best  in  identifying  F5,  JPH,  JS,  OG  and  UTS  A  image  classes  with 
classification  accuracies  >  90%.  ANN  performs  fairly  well  in  identifying  Clean  and  SH 
image  classes  with  classification  accuracies  between  75%  to  89%.  When  comparing 
Table  4.17  with  Table  4.16,  both  methods  appear  to  misclassify  MB  and  MB  12.  This  is  in 
large  part  due  to  the  features  being  used  while  two  different  feature  ranking  methods  are 
used,  i.e.,  expectation  maximization  uses  Bhattacharyya  feature  ranking  with  34  features 
and  k-NN  uses  Fisher’s  linear  discriminant  with  34  features,  30  of  the  34  feature  are  the 
same  in  both. 
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4.3.3  Probabilistic  Neural  Networks  (PNN) 


Table  4.18  shows  the  classification  accuracy  from  5 -fold  cross  validation  when 
performing  multi-class  classification  using  PNN.  The  feature  improvement  methods  and 
classification  parameters  used  in  probabilistic  neural  networks  are  listed  in  the  following, 
in  which  the  combination  of  parameters  provides  the  highest  classification  accuracy. 

•  The  data  is  normalized  using  Z-score  normalization; 

•  The  feature  ranking  is  conducted  using  signal-to-noise  ratio  with  the  top  58  out  of 
526  features  selected; 

•  Spread  parameter  <j  =  0.24. 


Table  4.18.  Classification  accuracy  for  10-class  PNN  classifier. 


In  Table  4.18,  the  results  show  that  the  MB,  MB  12  and  STN  image  classes  cannot  be 
separated  by  the  PNN  multi-class  system  since  their  classification  accuracies  are  of  mean 
values  57%,  58%  and  45%,  respectively.  Other  than  EM  and  k- NN  in  Table  4.16  and  4. 17, 
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PNN  classifies  five  stego  methods,  F5,  JPH,  JS,  OG  and  UTSA,  with  a  98%  classification 
accuracy  or  better;  however,  it  fails  to  separate  STN  from  Clean  and  MB  from  MB  12. 


4.3.4  Parzen  window 


Table  4.19  shows  the  classification  accuracy  from  5 -fold  cross  validation  when 
performing  multi-class  classification  using  Parzen  window.  The  feature  improvement 
methods  and  classification  parameters  used  in  Parzen  window  are  listed  in  the  following, 
in  which  the  combination  of  parameters  provides  higher  classification  accuracy. 

•  The  data  is  nonnalized  using  Z-score  normalization; 

•  Fisher’s  linear  discriminant  is  used  for  ranking  the  features  with  the  top  36  out  of 
526  features  selected; 

•  Window  width  cr=  0.85. 


Table  4.19.  Classification  accuracy  for  10-class  Parzen  window  classifier. 


Actual 

Clean 

F5 

JPH 

JS 

MB 

MB  12 

OG 

STN 

SH 

UTSA 

r’lpnn 

82± 

0± 

4+ 

0± 

0+ 

0+ 

1+ 

30+ 

0+ 

0+ 

v_McdTl 

9.0 

0.0 

4.1 

0.0 

0.0 

0.0 

2.2 

28.9 

0.0 

0.0 

0+ 

99± 

1± 

0+ 

1± 

0+ 

0+ 

0+ 

0+ 

4+ 

r  j 

0.0 

2.2 

2.2 

0.0 

2.2 

0.0 

0.0 

0.0 

0.0 

4.1 

TPFT 

0± 

0± 

90± 

0± 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

Jr  n. 

0.0 

0.0 

6.1 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

TC 

0+ 

0+ 

0± 

100± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

-d 

J  j 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0) 

+— * 

MR 

0± 

0± 

0± 

0± 

57+ 

53+ 

0+ 

0+ 

1+ 

0+ 

1V1JD 

0.0 

0.0 

0.0 

0.0 

14.5 

15.2 

0.0 

0.0 

2.2 

0.0 

<D 

C\  , 

MR1  ? 

0± 

0± 

0± 

0± 

33+ 

42+ 

0+ 

0+ 

1+ 

0+ 

Ph 

1V1D  1 2, 

0.0 

0.0 

0.0 

0.0 

9.7 

10.3 

0.0 

0.0 

2.2 

0.0 

nn 

0+ 

0+ 

0± 

0± 

0+ 

0+ 

99+ 

0+ 

0+ 

0+ 

vJCJ 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.2 

0.0 

0.0 

0.0 

CTN 

18± 

0± 

5+ 

0± 

1+ 

0+ 

0+ 

70+ 

2+ 

0+ 

O  1 IM 

9.0 

0.0 

6.1 

0.0 

2.2 

0.0 

0.0 

28.9 

2.7 

0.0 

QFT 

0± 

0± 

0± 

0± 

8+ 

5+ 

0+ 

0+ 

96+ 

0+ 

Oil 

0.0 

0.0 

0.0 

0.0 

10.3 

7.0 

0.0 

0.0 

4.1 

0.0 

TTT^nA 

0+ 

1± 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

96+ 

U  1  o/\ 

0.0 

2.2 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

4.1 
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In  Table  4.19,  the  results  show  that  Parzen  window  method  is  able  to  classify  F5,  JPH,  JS, 
OG,  SH,  and  UTSA  with  a  90%  classification  accuracy  or  better.  Although  it  fails  to 
separate  STN  from  Clean,  the  classification  accuracy  using  Parzen  window  instead  of  k- 
NN  and  PNN  improves  to  70%.  As  compared  to  Table  4.16,  the  Parzen  window  method 
performs  better  on  SH  with  a  96%  classification  accuracy  versus  86%  in  EM. 

4.3.5  Kernel  Fisher’s  Discriminant  (KFD)  with  Multi-class  Tree 

Table  4.20  shows  the  classification  accuracy  from  5-fold  cross  validation  when 
performing  multi-class  classification  using  KFD.  The  feature  improvement  methods  and 
classification  parameters  used  in  kernel  Fisher’s  discriminant  are  listed  in  the  following, 
in  which  the  combination  of  parameters  provides  the  highest  classification  accuracy. 

•  The  data  is  normalized  using  Z-score  normalization; 

•  The  feature  ranking  at  each  of  the  nodes  is  conducted  using  kernel  feature  ranking; 

•  The  nodes  correspond  to  Figure  4.2  where  the  top  50  features  are  used  for 
classification  in  node  A  (i.e.,  classes  123456789  and  10),  the  top  46  features 
selected  for  node  B  ,  the  top  34  features  for  node  D,  the  top  24  features  for  node  G, 
the  top  36  features  for  node  E,  the  top  32  features  selected  for  node  H,  the  top  26 
features  selected  for  node  I,  the  top  31  features  selected  for  node  C,  the  top  25 
features  selected  for  node  F; 
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=  [1  2  3  4  5  6  7  8  9  10] 


=  [7  9  10] 


F  =  [9  10] 


Clean  =  [1] 


'UTSA  =  [10] 


F5  =  [2]6  J>  6  ^I  =  [5  6] 

JPH  =  [3]  JS  =  [4] 

MB  =  [5 J  MB  12  =  [6] 

Figure  4.2.  Decision  tree  for  a  10-class  classification  problem  with  10  leaf  nodes,  9 
parent  nodes  and  a  maximum  depth  of  6. 


•  The  kernel  used  is  the  radial  basis  function  with  the  normalizing  constant  C=  12 
and  <7=  3. 
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Table  4.20.  Classification  accuracy  for  10-class  KFD  classifier. 

Actual 


Clean 

F5 

JPH 

JS 

MB 

MB  12 

OG 

STN 

SH 

UTSA 

Clean 

78± 

0± 

1± 

0± 

0± 

0+ 

0+ 

20+ 

0+ 

0+ 

5.7 

0.0 

2.2 

0.0 

0.0 

0.0 

0.0 

12.7 

0.0 

0.0 

F5 

0± 

94± 

0± 

0± 

0± 

0+ 

0+ 

0+ 

1+ 

2+ 

0.0 

6.5 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.2 

2.7 

JPH 

2± 

0± 

92± 

0± 

0± 

0+ 

0+ 

2+ 

0+ 

0+ 

2.7 

0.0 

4.4 

0.0 

0.0 

0.0 

0.0 

2.7 

0.0 

0.0 

JS 

0+ 

0± 

0± 

94± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0.0 

0.0 

0.0 

6.5 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

MB 

0± 

1± 

0± 

0± 

54+ 

40+ 

2+ 

0+ 

8+ 

1+ 

0.0 

2.2 

0.0 

0.0 

2.2 

10.6 

2.7 

0.0 

7.5 

2.2 

MB12 

0± 

0± 

0+ 

0+ 

40+ 

59+ 

0+ 

0+ 

1+ 

0+ 

0.0 

0.0 

0.0 

0.0 

3.5 

10.8 

0.0 

0.0 

2.2 

0.0 

OG 

0± 

0± 

0± 

0± 

0+ 

0+ 

98+ 

0+ 

0+ 

0+ 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.7 

0.0 

0.0 

0.0 

STN 

19± 

1+ 

0+ 

5± 

1+ 

0+ 

0+ 

78+ 

0+ 

0+ 

5.4 

2.2 

0.0 

7.0 

2.2 

0.0 

0.0 

14.4 

0.0 

0.0 

SH 

1± 

0± 

7+ 

1± 

5+ 

1+ 

2.2 

0+ 

0+ 

90+ 

0+ 

2.2 

0.0 

4.4 

2.2 

5.0 

0.0 

0.0 

7.9 

0.0 

0.0 

UTSA 

0± 

4± 

0± 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

97+ 

0.0 

4.1 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.7 

In  Table  4.20,  the  results  show  that  using  KFD  is  able  to  classify  F5,  JPH,  JS,  OG,  SH 
and  UTSA  with  a  90%  classification  accuracy  or  better.  Comparing  Table  4.16  to  Table 
4.19,  KFD  might  not  have  perfect  classification  accuracies  on  certain  methods,  however, 
it  perfonns  better  on  average  for  all  the  image  classes. 


4.3.6  Support  Vector  Machines  (SVM)  with  Multi-class  Tree 

Table  4.21  shows  the  classification  accuracy  from  5-fold  cross  validation  when 
performing  multi-class  classification  using  SVM. 

The  feature  improvement  methods  and  classification  parameters  used  in  support  vector 
machines  with  multi-class  tree  are  listed  in  the  following,  in  which  the  combination  of 
parameters  provides  the  highest  classification  accuracy. 

•  The  data  is  normalized  using  Z-score  normalization; 
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•  The  feature  ranking  at  each  of  the  nodes  was  conducted  using  kernel  feature 
ranking; 

•  The  nodes  correspond  to  Figure  4.3  with  the  top  90  features  selected  for  node  A, 
the  top  44  features  selected  for  node  B,  the  top  46  features  for  node  D,  the  top  21 
features  for  node  G,  the  top  63  features  for  node  E,  the  top  48  features  selected  for 
node  H,  the  top  19  features  selected  for  node  I,  the  top  46  features  selected  for 
node  C,  the  top  22  features  selected  for  node  F; 


Figure  4.3.  Decision  tree  for  a  10-class  classification  problem  with  10  leaf  nodes,  9 
parent  nodes  and  a  maximum  depth  of  6. 

•  The  kernel  used  was  the  radial  basis  function,  the  normalizing  constant  C  =  6,  and 
cr=  3. 
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Table  4.21.  Classification  accuracy  for  10-class  SVM  classifier. 

Actual 


Clean 

F5 

JPH 

JS 

MB 

MB  12 

OG 

STN 

SH 

UTSA 

86± 

0± 

0± 

0± 

0± 

0+ 

1+ 

17+ 

1+ 

0+ 

VMtdTl 

4.1 

0.0 

0.0 

0.0 

0.0 

0.0 

2.2 

16.0 

2.2 

0.0 

1± 

95± 

0± 

0± 

0± 

0+ 

0+ 

0+ 

0+ 

1+ 

r  J 

2.2 

3.5 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.2 

TPFT 

0± 

1+ 

92± 

0± 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

J1H 

0.0 

2.2 

4.4 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

tc 

1± 

4+ 

6± 

100± 

0+ 

0± 

2+ 

0+ 

2+ 

1+ 

"O 

J  o 

2.2 

2.2 

6.5 

0.0 

0.0 

0.0 

2.7 

0.0 

4.4 

2.2 

<D 

MR 

0± 

0± 

0± 

0± 

53± 

43+ 

0+ 

0+ 

1+ 

0+ 

1V1JD 

0.0 

0.0 

0.0 

0.0 

7.5 

12.0 

0.0 

0.0 

2.2 

0.0 

<D 

MR1  9 

0± 

0+ 

0+ 

0+ 

46+ 

56+ 

0+ 

0+ 

0+ 

0+ 

Ph 

1V1D  I Z 

0.0 

0.0 

0.0 

0.0 

8.2 

13.8 

0.0 

0.0 

0.0 

0.0 

on 

0± 

0± 

0± 

0± 

0+ 

0+ 

97+ 

0+ 

0+ 

0+ 

uu 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.7 

0.0 

0.0 

0.0 

11± 

0+ 

2+ 

0+ 

0+ 

0+ 

0+ 

82+ 

0+ 

0+ 

O  1 1M 

2.2 

0.0 

2.7 

0.0 

0.0 

0.0 

0.0 

16.0 

0.0 

0.0 

cu 

0± 

0± 

0± 

0± 

1± 

1+ 

0+ 

1+ 

96+ 

0+ 

Oil 

0.0 

0.0 

0.0 

0.0 

2.2 

2.2 

0.0 

2.2 

4.1 

0.0 

TTT^A 

0+ 

0± 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

98+ 

U  1  on 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.7 

In  Table  4.21,  the  results  show  that  the  MB  and  MB  12  image  classes  cannot  be  separated 
by  the  SVM  multi-class  system  since  their  classification  accuracies  are  of  mean  values 
53%  and  56%,  respectively.  However,  Table  4.21  shows  that  SVM  with  multi-class  tree 
performs  better  in  general  on  other  image  classes  when  comparing  to  other  classifiers 
from  Table  4.16  to  Table  4.20.  For  instance,  Clean  has  a  classification  accuracy  of  86% 
and  STN  has  a  classification  accuracy  of  82%  which  are  both  larger  than  the  other  five 
multi-class  classifiers. 


4.3.7  StegoWatch 

Table  4.22  shows  the  classification  accuracy  from  5-fold  cross  validation  when 
performing  multi-class  classification  using  StegoWatch.  Observe  from  Table  4.22  that 
StegoWatch  clearly  targets  the  identification  of  F5  embedding  method  above  all  others. 
For  this  tool  the  results  are  returned  as  either  H,  M  or  L  for  a  high,  medium  or  low  stego 
detection  level.  If  the  image  is  clean  an  OK  is  returned  indicating  that  the  image  is  clean. 
For  the  image  data  set  being  analyzed  in  this  research  StegoWatch  also  returns  a 
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comment  indicating  that  F5  has  been  identified.  For  this  tool  three  classes  are  assigned. 
In  the  event  an  H  or  M  is  returned  the  image  is  considered  as  being  stego,  if  an  L  or  OK  is 
returned  the  image  is  labeled  as  clean  and  if  the  comment  indicates  that  F5  was  identified 
then  F 5  is  the  class  label. 


Table  4.22.  Classification  accuracy  for  StegoWatch  detection  system. 


Actual 

Clean 

F5 

Stego 

"O 

Clean 

51± 

0± 

48± 

<D 

6.9 

0.0 

12.4 

O 

F5 

0± 

100± 

0± 

<1) 

Jh 

0.0 

0.0 

0.0 

Stego 

49± 

0± 

52± 

6.9 

0.0 

12.4 

In  Table  4.22,  the  results  show  the  classification  accuracies  on  Clean,  F5  and  all  of  the 
other  (Stego)  image  classes.  Except  F5,  the  other  image  classes  cannot  be  separated  by 
the  multi-class  system  since  their  classification  accuracies  are  around  50%. 

4.3.8  Summary  of  Steganalysis  Multi-Class  results 

Table  4.23  summarizes  the  classification  accuracies  of  the  seven  multi-class  classifiers 
that  were  examined  in  this  chapter.  Since  StegoWatch  is  clearly  specialized  in  identifying 
F5,  it  will  not  be  included  in  the  comparison  performed  in  the  proceeding  analysis  of 
identifying  the  multi-class  classifier  that  targets  specific  embedding  methods.  However, 
for  completeness  the  StegoWatch  classification  accuracy  is  still  depicted  in  Table  4.23. 
Statistical  significance  comparing  the  best  of  the  true  multi-class  classifiers  (i.e.,  EM,  k- 
NN,  Parzen  and  PNN)  with  the  best  of  the  tree  structure  multi-class  classifiers  (i.e.,  KFD 
and  SVM)  is  conducted  using  a  t-test  and  shown  in  Table  4.24.  The  best  classifiers 
according  to  the  defined  grouping  are  indicated  in  bold  in  Table  4.23,  which  are  then 
used  in  the  statistical  comparison  in  Table  4.24.  Based  on  overall  classification  accuracy 
in  Table  4.23,  the  best  individual  system  appears  to  be  SVM  with  used  in  a  multi-class 
tree  structure. 
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Table  4.23.  Classification  accuracy  for  mul 


ti-class  detection  system. 


Clean 

F5 

JPH 

JS 

MB 

MB  12 

OG 

STN 

SH 

UTSA 

CA 

EM 

83± 

88± 

90± 

100± 

51± 

42± 

99± 

79± 

86± 

96± 

81.4± 

5.7 

9.0 

7.0 

0.0 

11.9 

9.0 

2.2 

18.5 

6.5 

4.1 

20.5 

k-  NN 

78± 

95± 

92± 

100± 

54± 

47± 

99± 

65± 

86± 

94± 

81± 

7.5 

6.1 

2.7 

0.0 

6.5 

9.7 

2.2 

11.1 

10.8 

6.5 

19.6 

Parzen 

82± 

99± 

90± 

100± 

57± 

42± 

99± 

70± 

96± 

96± 

83. 1± 

9.0 

2.2 

6.1 

0.0 

14.4 

10.3 

2.2 

28.9 

4.1 

4.1 

22.0 

PNN 

84± 

99± 

100± 

100± 

57± 

58± 

98± 

45± 

91± 

100± 

83. 2± 

5.4 

2.2 

0.0 

0.0 

9.0 

6.7 

2.7 

15.8 

7.4 

0.0 

21.5 

KFD 

78± 

94± 

92± 

94± 

54± 

59± 

98± 

78± 

90± 

97± 

83. 4± 

5.7 

6.5 

4.4 

6.5 

2.2 

10.8 

2.7 

14.4 

7.9 

2.7 

16.5 

SVM 

86± 

95± 

92± 

100± 

53± 

56± 

97± 

82± 

96± 

98± 

85.5± 

4.1 

3.5 

4.4 

0.0 

7.5 

13.8 

2.7 

16.0 

4.1 

2.7 

17.9 

Stego 

Watch 

51± 

6.9 

100± 

0.0 

52± 

12.4 

67. 7± 
10.6 

Table  4.24.  t-test:  paired  two  samples  for  means. 


/-Critical  Two-Tail;  /4,0.97s  =  2.776,  n  1  =  n2  =  5  (corresponding  to  5-fold),  a  =  0.05 

Image 

Class 

Clean 

F5 

JPH 

JS 

MB 

MB12 

OG 

STN 

SH 

UTSA 

Classifier 

Comparison 

PNN 

vs. 

SVM 

PNN 

vs. 

SVM 

PNN 

vs. 

SVM 

PNN 

vs. 

SVM 

PNN 

vs. 

KFD 

PNN 

vs. 

KFD 

EM 

vs. 

KFD 

EM 

vs. 

SVM 

Parzen 

vs. 

SVM 

PNN 

vs. 

SVM 

/-  Stat 

0.49 

2.13 

4.0 

0.0 

0.6 

0.27 

1.0 

0.8 

0.0 

1.63 

Statistically 

Significant 

No 

No 

Yes 

No 

No 

No 

No 

No 

No 

No 

As  can  be  seen  in  Table  4.24,  only  the  JPH  image  class  shows  significant  difference  in 
the  mean  between  the  best  result  from  a  multi-class  classifier  and  the  best  result  from  the 
multi-class  tree  results.  The  difference  in  the  mean  for  the  rest  of  the  image  classes  are 
not  statistically  significant.  In  addition,  the  results  from  the  individual  tables  show  that 
the  various  classifiers  each  have  individual  strengths  when  identifying  the  various 
embedding  methods.  To  take  advantage  of  the  individual  classifiers  the  next  section  uses 
fusion  to  combine  the  seven  detection  systems. 
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4.4  Fusion 


From  Section  4.3,  no  advantage  of  single  multi-class  classifiers  has  been  shown;  instead 
each  of  the  multi-class  classifier  has  individual  strength.  To  make  use  of  the  individual 
strengths  of  the  classifiers,  the  three  fusion  techniques  presented  in  Section  3.4  are  used 
and  the  results  shown  in  this  section.  For  AdaBoost  and  Bayesian  network  fusion  the 
class  labels  are  fused  as  discrete  values.  For  the  commercial  tool,  StegoWatch,  the  results 
are  returned  as  either  L,  OK  indicating  a  clean  class  label,  or  F5.  The  PNN  fusion 
however,  requires  that  the  results  feed  into  the  fusion  system  be  posterior  probabilities. 
To  solve  this  problem  for  the  commercial  tool  two  inputs  are  used,  clean  or  F5.  If  the 
result  returned  is  clean,  L  or  OK,  a  posterior  probability  of  0.9  is  assigned  and  the  F5 
input  is  assigned  a  0.01.  If  the  result  retuned  is  F5  a  posterior  probability  of  0.9  is 
assigned  and  the  clean  input  is  assigned  a  0.01.  For  the  10-class  classifiers  probabilities 
are  assigned  to  each  of  the  10  classes  but  only  9  of  the  10  labels  from  each  of  the 
classifiers  is  used  to  train  the  fusion  system,  allowing  proper  training  of  the  weights. 

4.4.1  AdaBoost 

The  results  for  AdaBoost  fusion  are  shown  in  Table  4.25.  Fusing  the  seven  multi-class 
systems  results  in  detecting  the  Clean  and  Steganos  (STN)  classes  as  well  as  the  Model- 
based  and  Model-based  version  1 .2  are  improved. 
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Table  4.25.  Classification  accuracy  for  AdaBoost  fusion. 


Actual 

Clean 

F5 

JPH 

JS 

MB 

MB  12 

OG 

STN 

SH 

UTSA 

86± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

16+ 

0+ 

0+ 

4.1 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

19.8 

0.0 

0.0 

0± 

100+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

r  j 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

TP  FT 

1± 

0+ 

100+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

jrn 

2.2 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

tc 

0± 

0+ 

0+ 

100+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

"O 

J  j 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

CD 

MR 

0± 

0+ 

0+ 

0+ 

63+ 

37+ 

0+ 

0+ 

2+ 

0+ 

IVIJD 

0.0 

0.0 

0.0 

0.0 

7.5 

10.3 

0.0 

0.0 

4.4 

0.0 

CD 

MR1  9 

0± 

0+ 

0+ 

0+ 

34+ 

61+ 

0+ 

0+ 

0+ 

0+ 

Ph 

1V1D  I  Z 

0.0 

0.0 

0.0 

0.0 

6.5 

8.2 

0.0 

0.0 

0.0 

0.0 

on 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

100+ 

0+ 

0+ 

0+ 

vJ  vJ 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

CTN 

13+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

84+ 

1+ 

0+ 

O  1 1M 

4.4 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

19.8 

2.2 

0.0 

QU 

0+ 

0+ 

0+ 

0+ 

3+ 

2+ 

0+ 

0+ 

97+ 

0+ 

oil 

0.0 

0.0 

0.0 

0.0 

2.7 

2.7 

0.0 

0.0 

4.1 

0.0 

TTT^A 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

100+ 

U  1  OAV 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

With  AdaBoost,  the  classification  accuracies  of  MB  and  MB  12  are  63%  and  61%. 
Comparing  to  the  best  individual  classifier  as  shown  in  Table  4.23,  AdaBoost  actually 
improves  the  classification  capability  of  these  two  image  classes. 


4.4.2  Bayes  Fusion 


The  results  for  Bayes  fusion  are  shown  in  Table  4.26.  Similar  to  AdaBoost,  this  fusion 
method  also  improved  the  classification  accuracy  between  the  Clean  and  Steganos  (STN) 
classes  as  well  as  the  Model-based  and  Model-based  version  1.2. 
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Table  4.26.  Classification  accuracy  for  Bayes  fusion. 


Actual 

Clean 

F5 

JPH 

JS 

MB 

MB  12 

OG 

STN 

SH 

UTSA 

r’lppri 

89± 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

13+ 

1+ 

0+ 

2.2 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

16.4 

2.2 

0.0 

0± 

100+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

1+ 

0+ 

r  J 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.2 

0.0 

TP  FT 

1± 

0+ 

100+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

Jin 

2.2 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

tc 

0± 

0+ 

0+ 

100+ 

0+ 

0+ 

0+ 

0+ 

1+ 

0+ 

"O 

J  j 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.2 

0.0 

<D 

MR 

0± 

0+ 

0+ 

0+ 

63+ 

37+ 

0+ 

0+ 

1+ 

0+ 

IVIJD 

0.0 

0.0 

0.0 

0.0 

8.3 

8.3 

0.0 

0.0 

2.2 

0.0 

CD 

MR1  9 

0± 

0+ 

0+ 

0+ 

34+ 

63+ 

0+ 

0+ 

0+ 

0+ 

Ph 

1V1D  I  Z 

0.0 

0.0 

0.0 

0.0 

10.8 

8.3 

0.0 

0.0 

0.0 

0.0 

on 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

100+ 

0+ 

0+ 

0+ 

vJ  vJ 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

CTN 

10± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

86+ 

0+ 

0+ 

O  1 IM 

3.5 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

15.5 

0.0 

0.0 

QU 

0± 

0+ 

0+ 

0+ 

3+ 

0+ 

0+ 

1+ 

96+ 

0+ 

oil 

0.0 

0.0 

0.0 

0.0 

2.7 

0.0 

0.0 

2.2 

4.1 

0.0 

TTT^A 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

100+ 

U  1  OAV 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

As  compared  to  the  best  individual  classifier  shown  in  Table  4.23,  this  fusion  technique 
actually  at  the  very  least  maintains  or  improves  every  classification  accuracy  for  all  the 
image  classes. 

4.4.3  PNN  Fusion 

The  results  for  PNN  fusion  are  shown  in  Table  4.27.  Similar  to  the  previous  two  fusion 
systems,  the  classification  accuracy  between  the  Clean  and  STN  classes  as  well  as  the 
MB  and  MB  12  are  also  improved. 
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Table  4.27.  Classification  accuracy  for  PNN  fusion. 


Actual 

Clean 

F5 

JPH 

JS 

MB 

MB  12 

OG 

STN 

SH 

UTSA 

r’lpQn 

88± 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

15+ 

1+ 

0+ 

W-IcdTl 

2.7 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

17.6 

2.2 

0.0 

0+ 

100+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

r  J 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

TP  FT 

1± 

0+ 

100+ 

0+ 

0+ 

0+ 

0+ 

1+ 

0+ 

0+ 

jrn 

2.2 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

2.2 

0.0 

0.0 

tc 

0± 

0+ 

0+ 

100+ 

0+ 

0+ 

0+ 

0+ 

3+ 

0+ 

"O 

J  j 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

4.4 

0.0 

<D 

MR 

0± 

0+ 

0+ 

0+ 

56+ 

34+ 

0+ 

0+ 

0+ 

0+ 

1V1D 

0.0 

0.0 

0.0 

0.0 

13.8 

10.8 

0.0 

0.0 

0.0 

0.0 

CD 

MR1  9 

0± 

0+ 

0+ 

0+ 

40+ 

63+ 

0+ 

0+ 

0+ 

0+ 

Ph 

1V1D  I  Z 

0.0 

0.0 

0.0 

0.0 

14.5 

8.3 

0.0 

0.0 

0.0 

0.0 

on 

0± 

0+ 

0+ 

0+ 

0+ 

0+ 

100+ 

0+ 

0+ 

0+ 

wVjr 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

CTN 

11+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

84+ 

0+ 

0+ 

O  1 IM 

2.2 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

19.8 

0.0 

0.0 

QU 

0+ 

0+ 

0+ 

0+ 

4+ 

3+ 

0+ 

0+ 

96+ 

0+ 

Oil 

0.0 

0.0 

0.0 

0.0 

2.2 

2.7 

0.0 

0.0 

4.1 

0.0 

TTT^A 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

0+ 

100+ 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

With  PNN  fusion,  the  classification  accuracies  of  Clean,  F5,  MB,  MB  12  and  STN  image 
classes  are  improved  as  compared  to  the  best  individual  classifier  shown  in  Table  4.23. 


4.4.4  Summary  of  Multi-class  Steganalysis  Fusion  Techniques 

Table  4.28  shows  the  classification  accuracy  of  the  fusion  methods  and  the  best 
individual  classifier,  i.e.,  SVM  with  multi-class  tree.  It  is  chosen  as  the  best  individual 
classifier  purely  based  on  the  overall  classification  accuracy  of  85.5%  (Table  4.23).  Table 
4.28  shows  that  by  using  any  fusion  technique  classification  accuracy  improves  over  the 
best  individual  classifier.  It  shows  that  each  of  the  fusion  methods  has  an  equal  or  higher 
classification  accuracy  over  any  of  the  best  individual  classifiers  results. 
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Table  4.28.  Classification  accuracy  comparisons  between  the  best  individual  results  and 

the  three  fusion  methods. 


Clean 

F5 

JPH 

JS 

MB 

MB  12 

OG 

STN 

SH 

UTSA 

CA 

SVM 

86± 

95± 

92± 

100± 

53± 

56± 

97± 

82± 

96± 

98± 

85. 5± 

4.1 

3.5 

4.4 

0.0 

7.5 

13.8 

2.7 

16.0 

4.1 

2.7 

17.3 

AdaBoost 

86± 

100± 

100± 

100± 

63± 

61± 

100± 

84± 

96± 

100± 

89± 

Fusion 

4.1 

0.0 

0.0 

0.0 

7.5 

8.2 

0.0 

19.8 

4.1 

0.0 

15.3 

Bayes 

89± 

100± 

100± 

100± 

63± 

63± 

100± 

86± 

96± 

100± 

89. 7± 

Fusion 

2.2 

0.0 

0.0 

0.0 

7.5 

8.3 

0.0 

15.5 

4.1 

0.0 

15.3 

PNN 

89± 

100± 

100± 

100± 

65± 

63± 

100± 

86± 

96± 

100± 

89. 9± 

Fusion 

2.2 

0.0 

0.0 

0.0 

3.5 

8.3 

0.0 

15.5 

4.1 

0.0 

14.9 

The  three  fusion  techniques  are  equally  valid  choices  for  combining  the  individual  multi¬ 
class  classifier  from  Section  4.3.  In  Table  4.29  the  /-test  is  performed  between  the  PNN 
fusion  (highest  overall  CA  in  the  fusion  methods  examined)  and  the  SVM  multi-class 
classifier  (highest  overall  CA  of  the  individual  multi-class  classifier)  to  determine 
whether  the  difference  in  the  means  between  these  two  methods  is  statistically  significant. 
As  noted  in  Table  4.29  the  two  methods  show  statistical  differences  for  the  F5,  JPH,  MB, 
and  STN  image  classes. 


Table  4.29.  t-test:  paired  two  samples  for  means  of  classification  accuracy  between  PNN 

fusion  and  SVM. 


/-Critical  Two-Tail;  U,  0.975  =  2.776,  n\  =  nl  =  5  1 

(corresponding  to  5-fold),  a  =  0.05 

Image 

Class 

Clean 

F5 

JPH 

JS 

MB 

MB12 

OG 

STN 

SH 

UTSA 

/-Slat 

1.5 

3.16 

4 

0.0 

2.9 

1.20 

2.44 

4 

0.0 

1.63 

Statistically 

Significant 

No 

Yes 

Yes 

No 

Yes 

No 

No 

Yes 

No 

No 

In  this  subsection  it  was  shown  that  the  fusion  techniques  have  equal  or  greater 
classification  accuracy  over  any  of  the  individual  classifiers.  In  addition  to  the  statistical 
significance  for  certain  image  classes  shown  in  Table  4.29,  the  fusion  methods  also  show 
an  increase  in  classification  accuracy  over  any  of  the  individual  detection  systems. 
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4.5  Summary 


In  this  chapter,  the  DCT  decomposition  feature  generation,  kernel  feature  ranking, 
decision  tree  for  multi-class  classification  and  the  fusion  techniques  have  shown 
improvements  in  classification  accuracy  when  determining  the  stego  algorithm  used  to 
create  a  stego  image.  Results  comparing  the  feature  generation  methods  described  in 
Section  3.2  show  that  the  new  features  are  able  to  distinguish  between  Steganos  and  the 
other  classes  while  the  wavelet  feature  generation  method  (Lyu  and  Farid,  2002)  and 
DCT  feature  generation  (Pevny  and  Fridrich,  2006)  have  shown  difficulty  in  identifying 
Steganos.  It  has  also  been  shown  that  by  combining  all  of  the  feature  generation  methods, 
detection  improves.  Additionally,  by  performing  feature  ranking,  detection  results  for  the 
SVM  classifier  are  improved.  The  third  area  of  improvement  is  the  development  of  a 
multi-class  tree  that  is  used  with  two-class  KFD  and  SVM  classifiers.  The  tree  in  this 
case  is  expanded  by  using  a  distance  measure  in  the  kernel  space.  While  the  classification 
tree  shows  promise,  the  results  can  additionally  be  improved  through  the  use  of  a  fusion 
technique.  The  fusion  techniques  use  the  strengths  of  each  individual  multi-class 
detection  systems  to  better  predict  the  embedding  method.  The  /-test  was  used  in  this 
chapter  to  determine  if  the  methods  used  to  improve  the  classification  of  individual 
steganography  methods  are  statistically  significant.  While  no  individual  system  showed 
to  be  statistically  significant  over  any  of  the  others,  it  is  important  to  note  that  the  real 
utility  of  the  methods  in  this  research  lies  in  using  each  and  every  available  detection 
system  to  improve  the  identification  of  steganography  methods. 
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V.  Conclusion  and  Recommendations 


This  research  demonstrated  a  steganalysis  classification  system  that  identifies  the 
steganalysis  embedding  method  in  a  given  JPEG.  The  system  includes  feature 
preprocessing,  feature  extraction,  feature  ranking,  classification  and  multi-class 
classification.  The  methodology,  analyses  and  experimental  results  with  system 
validation  have  been  described  and  demonstrated  in  Chapters  3  and  4.  The  results  show 
the  statistical  difference  of  the  proposed  classification  system  which  is  essential  for  such 
a  system.  This  chapter  summarizes  the  research  conducted  and  also  provides  the 
advantage  and  disadvantage  of  this  steganalysis  classification  system.  Further  research 
can  be  applied  not  only  based  on  the  constraints  and  limitations  when  developing  the 
system  but  by  its  application  to  other  areas. 

5.1  Application  of  Results 

This  research  proposes  a  novel  multi-class  detection  system  applied  to  the  problem  of 
steganalysis.  The  complete  system  is  shown  in  Figure  5.1.  With  the  input  including  the 
clean  and  stego  image  sets  using  the  embedding  methods  either  F5,  JP  Hide,  JSteg, 
Model-base,  Model-based  Version  1.2,  OutGuess,  Steganos,  StegHide  or  UTSA,  features 
are  generated  from  each  image  and  each  feature  set  is  assigned  a  class  label  identifying 
the  embedding  method  used.  Three  components,  Multi-class  Detection  for  EM/k- 
NN/Parzen/PNN,  Multi-class  Detection  for  KFD/SVM,  and  Commercial  Detection 
Systems  are  integrated  as  an  8  multi-class  system.  The  components  analyze  the  raw 
features  and  their  results  are  fused  in  order  to  assign  a  final  class  label. 
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Figure  5.17.  Detection  system. 


The  multiclass  fusion  system  developed  in  this  dissertation  provides  the  steganalyst  the 
ability  to  use  all  available  tools  from  both  the  research  community  and  the  commercial 
industry  to  be  combined  in  one  detection  system.  For  certain  law  enforcement  agencies 
that  use  detection  methods  not  available  to  outside  agencies  (i.e.,  ILook  Investigator, 
Detica’s  Inforenz  Forager,  SecureStego  (AFRL)  and  WetStone’s  Stego  Suite  with  added 
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applications)  the  fusion  method  provides  a  means  to  incorporate  the  class  labels  of  any 
tools  necessary  for  identifying  stego  methods. 

As  shown  in  Section  4.2,  the  classification  accuracy  for  each  feature  generation  method, 
Wavelet  features,  DCT  features,  and  DCT  decomposition  features,  with  feature  ranking  is 
increased.  Using  SVM-kemel  feature  ranking,  the  wavelet  feature  generation  method  has 
an  average  increase  of  7%  in  classification  accuracy  when  feature  ranking  is  not  used. 
The  DCT  features  show  an  increase  in  classification  accuracy  on  Clean  vs.  MB  and  Clean 
vs.  STN  with  0.5%  and  1.6%  respectively  when  using  SVM-kernel  feature  ranking.  The 
DCT  decomposition  features  have  an  average  increase  of  2.5%  classification  accuracy  in 
comparison  to  not  using  feature  ranking.  Between  the  three  feature  generation  methods 
shown  in  Table  4.14,  while  the  DCT  features  are  able  to  classify  most  of  the  stego 
methods  accurately,  the  proposed  DCT  decomposition  features  has  an  increase  in 
classification  accuracy  of  10%  on  Steganos  (STN)  over  the  DCT  features.  This  allows  the 
combination  of  features  with  feature  ranking  to  separate  the  Clean  vs.  all  the  Stego  image 
classes  as  shown  in  Table  4.14  with  perfect  classification  accuracy.  By  creating  a  multi¬ 
class  classifier  using  the  decision  tree  in  Section  3.3  the  proposed  SVM  with  tree 
structure  has  an  increase  of  classification  accuracy  of  2.3%  over  PNN  as  shown  in  Table 
4.23.  Furthermore,  with  the  use  of  fusion  techniques,  the  overall  classification  accuracy 
of  the  best  individual  classifier  increases  from  85.5%  to  89%  (see  Table  4.28).  AdaBoost, 
Bayes,  and  PNN  fusion  obtain  the  classification  accuracy  of  89%,  89.7%  and  89.9%, 
respectively. 

5.2  Recommendations  for  Future  Work 

The  need  to  extract  the  hidden  information  is  necessary  for  law  enforcement  to  build  a 
criminal  case  if  it  is  to  hold  up  in  court.  This  problem  of  extraction  leads  to  an 
intennediate  step  of  identifying  the  embedding  method  used  to  create  the  stego  file. 
Another  problem  exist  for  the  steganalyst  in  which  several  tools  are  available  to  detect 
whether  an  image  is  clean  or  stego.  The  multi-class  classification  system  developed  needs 
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to  be  expanded  to  identify  more  steganography  algorithms.  This  expansion  includes  the 
ability  to  identify  embedding  techniques  other  than  DCT  coefficients,  such  as  header 
analysis,  and  spatial  embedding  methods.  Additionally,  the  techniques  should  be 
extended  to  classify  JPEG  images  with  different  image  sizes,  quality  factors,  and  camera 
types.  Also,  the  difficulties  are  currently  not  well  understood  when  it  comes  to  images 
taken  from  entirely  different  scenes  or  computer  generated. 

In  header  analysis,  stego  images  created  by  methods  such  as  F5  (Westfeld,  2001;  2003) 
and  Invisible  Secrets  (2008)  manipulate  the  header  of  an  image  in  different  ways.  By 
analyzing  the  header  of  an  image  various  embedding  methods  used  to  manipulate  in 
headers  can  be  identified.  In  both  StegAlyzerSS  and  StegoWatch  the  default  header  for 
F5  was  identified,  however,  for  Invisible  Secretes  neither  of  these  detectors  is  capable  of 
identifying  this  method.  The  work  by  Pevny  and  Fridrich  (2006)  analyze  various  image 
sizes,  quality  factors  and  camera  types  and  are  supported  by  the  Air  Force  Research 
Faboratory.  Their  research  of  these  categories  can  be  used  in  conjunction  with  the 
presented  detection  system  to  improve  the  identification  of  the  embedding  methods  used. 

Another  area  of  improving  stego  method  identification  is  to  separate  images  into  different 
scenes,  e.g.,  images  of  an  aircraft  with  blue  sky  should  not  be  in  the  same  data  set  as 
images  of  an  individual  smiling.  By  separating  images  into  the  various  categories  the 
problems  encountered  in  Section  4.4  with  outliers  can  be  avoided.  The  number  of  varying 
scenes  is  a  research  topic  that  has  been  extensively  studied  can  be  incorporated  into  the 
work  provided  in  this  document. 

5.3  Conclusion 

This  dissertation  proposes  a  novel  multi-class  classification  system  on  steganalysis.  This 
research  of  developing  the  steganalysis  classification  system  has  contribution  in  four 
advancements:  feature  generation,  feature  ranking,  multi-class  for  kernel  Fisher’s 
discriminant  as  well  as  support  vector  machines  and  fusion  of  detection  systems.  First, 
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the  new  features  are  generated  from  the  frequency  bands  and  directions  of  the  Discrete 
Cosine  Transform  (DCT)  coefficients  of  JPEG  images.  The  second  improvement  is  a  new 
feature  ranking  method.  From  the  original  input  feature  set,  it  selects  a  subset  of  features 
specifically  designed  for  the  kernel  Fisher’s  discriminant  (KFD)  and  the  support  vector 
machines  (SVM).  The  third  improvement  is  a  multi-class  classification  tree  designed  for 
the  KFD  and  SVM  classifiers.  The  final  contribution  of  this  steganalysis  classification 
system  is  a  multi-class  classifier  fusion  with  classifier  selection  and  fusion.  The  complete 
system  perfonnance  shows  an  increase  in  classification  accuracy  of  10%  as  well  as  being 
statistically  different  from  existing  detection  techniques.  In  addition,  this  system  provides 
a  solution  for  identifying  steganographic  fingerprints  as  well  as  the  ability  to  include 
future  multi-class  classification  tools. 
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Appendix  A 


Table  Al.  Number  of  features  used  from  each  of  the  feature  generation  method  in  feature 
combination  (Table  4.15  Detail  for  Clean  vs.  F5,  Clean  vs.  JPH  and  Clean  vs.  JS). 


Clean  vs.  F5 

Clean  vs.  JPH 

Clean  vs.  JS 

Method 

Total  No.  of 

Features  — > 

11 

18 

5 

Wavelets 

No.  of  Features 

1 

3 

0 

(Number  of 
Features) 
Description  of 
Feature: 
Statistic 
Calculated, 
Orientation, 
Subband  Scale 
(level), 
either 
Wavelet 
Coefficients  or 
Log  Error 

(1)  Variance, 

Horizontal,  l,Log 

Error 

(1)  Mean,  Diagonal,  1, 
Log  Error 
(1)  Variance, 
Horizontal,  1,  Log 

Error 

(1)  Kurtosis,  Vertical, 

1,  Log  Error 

(0) 

DCT 

No.  of  Features 

5 

12 

4 

(Number  of 
Features) 
Description  of 
Feature: 
either 

Global  Flistogram 
AC  Flistogram 
Dual  Flistogram 
Variation 
Blockiness 
Co-occurance 
Markov 

(1)  AC  Histogram 
(3)  Dual  Histogram 
(1)  Markov 

(3)  AC  Histogram 
(3)  Dual  Histogram 
(5)  Co-  occurance 
(1)  Markov 

(3)  Co-  occurance 
(1)  Markov 

DCT 

Decomp 

No.  of  Features 

5 

3 

1 

(Number  of 
Features) 
Description  of 
Feature: 
Statistic 
Calculated, 
Orientation, 
Frequency, 
either 
Regression 
Mean  Difference 
Avg.  Neighboring 
Shifted  Right 
Shifted  Down 
Shifted  Diagonal 

(1)  Variance, 

Diagonal,  Low, 
Regression 
(1)  Variance, 

Horizontal,  Low,  Avg. 

Neighboring 

(1)  Variance, 

Horizontal,  Low, 

Shifted  Diagonal 
(1)  Variance,  Vertical, 
Low,  Shifted  Diagonal 
(1)  Entropy,  Vertical, 
Low,  Shifted  Diagonal 

(1)  Variance, 

Diagonal,  Low, 
Regression 
(1)  Entropy, 

Horizontal,  Low,  Avg. 
Neighboring 
(1)  Variance,  Vertical, 
Low,  Shifted 

Diagonal 

(1)  Variance, 

Diagonal,  Medium, 
Shifted  Diagonal 
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Table  A2.  Number  of  features  used  from  each  of  the  feature  generation  method  in  feature 


combination  (Table  4.15  Detail  for  Clean  vs.  MB,  Clean  vs.  MB  1.2  and 

Clean  vs.  OG). 

Clean  vs.  MB 

Clean  vs.  MB  1.2 

Clean  vs.  OG 

Method 

Total  No.  of 

Features  — » 

6 

10 

5 

Wavelets 

No.  of  Features 

0 

0 

0 

(Number  of 
Features) 
Description  of 
Feature: 
Statistic 
Calculated, 
Orientation, 
Subband  Scale 
(level), 
either 
Wavelet 
Coefficients  or 
Log  Error 

(0) 

(0) 

(0) 

DCT 

No.  of  Features 

5 

5 

4 

(Number  of 
Features) 
Description  of 
Feature: 
either 

Global  Flistogram 
AC  Flistogram 
Dual  Flistogram 
Variation 
Blockiness 
Co-occurance 
Markov 

(4)  Co-  occurance 
(1)  Markov 

(2)  Co-  occurance 

(3)  Markov 

(1)  AC  Flistogram 
(3)  Markov 

DCT 

Decomp 

No.  of  Features 

1 

5 

1 

(Number  of 
Features) 
Description  of 
Feature: 
Statistic 
Calculated, 
Orientation, 
Frequency, 
either 
Regression 
Mean  Difference 
Avg.  Neighboring 
Shifted  Right 
Shifted  Down 
Shifted  Diagonal 

(1)  Variance, 
Diagonal,  Low, 
Regression 

(1)  Variance,  Diagonal, 
Low,  Regression 
(1)  Variance, 

Florizontal,  Low,  Avg. 
Neighboring 
(1)  Variance,  Diagonal, 
Low,  Mean  Difference 
(1)  Variance,  Vertical, 
Low,  Shifted  Diagonal 
(1)  Entropy,  Vertical, 
Low,  Shifted  Diagonal 

(1)  Mean,  Vertical, 
Medium, 

Regression 
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Table  A3.  Number  of  features  used  from  each  of  the  feature  generation  method  in  feature 
combination  (Table  4.15  Detail  for  Clean  vs.  STN,  Clean  vs.  SH  and  Clean  vs.  UTSA). 


Clean  vs.  STN 

Clean  vs.  SH 

Clean  vs. 
UTSA 

Method 

Total  No.  of 

Features  — » 

15 

7 

5 

No.  ofFeatures 

2 

0 

0 

Wavelets 

(Number  of 
Features) 
Description  of 
Feature: 
Statistic 
Calculated, 
Orientation, 
Subband  Scale 
(level), 
either 

Wavelet 

Coefficients  or  Log 
Error 

(1)  Variance, 

horizontal  subband  at  scale 

1, 

log  error 

(1)  Mean,  diagonal 
subband  at  scale  1 

(0) 

(0) 

No.  ofFeatures 

7 

5 

5 

DCT 

(Number  of 
Features) 
Description  of 
Feature: 
either 

Global  Flistogram 
AC  Flistogram 
Dual  Flistogram 
Variation 
Blockiness 
Co-occurance 
Markov 

(1)  Global  histogram 

( 1 )  AC  histogram 
(3)  Dual  histogram 

(2)  Co-  occurance 

(4)  Co-occurance 
(1)  Markov 

(4)  Co- 
occurance 
(1)  Markov 

No.  ofFeatures 

6 

2 

0 

DCT 

Decomp 

(Number  of 
Features) 
Description  of 
Feature: 
Statistic 
Calculated, 
Orientation, 
Frequency, 
either 
Regression 

Mean  Difference 
Avg.  Neighboring 
Shifted  Right 
Shifted  Down 
Shifted  Diagonal 

(1)  Variance,  Diagonal, 

Low,  Regression, 

(1)  Variance,  Horizontal, 
Low,  Average 

Neighboring 
(1)  Variance,  Horizontal, 
Low,  Shifted  Diagonal 
(1)  Variance,  Vertical, 

Low,  Shifted  Diagonal 
(1)  Variance,  Diagonal, 
Low,  Mean  Difference 
(1)  Entropy,  Vertical,  Low, 
Shifted  Diagonal 

(1)  Variance,  Horizontal, 
Low,  Avg.  Neighboring 
(1)  Variance,  Vertical, 
Low,  Shifted  Diagonal 

(0) 
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